Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

CombineGVCF on chr6,7,8,9,X hg38 running for ever~

0

7 comments

  • Avatar
    SkyWarrior

    Hi XiuChao Wang

    Can you elaborate a little more on your compute environment? 

    Also CombineGVCFs tool is although useful combining too many samples is not a recommended way of operating joint genotyping using this tool. Instead GATK team recommends GenomicsDBImport tool which is way faster (ability to load in batches and using multi-threaded gvcf read)  and more robust for more samples. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    XiuChao Wang

    “Can you elaborate a little more on your compute environment? ” :

    HaplotypeCaller Command shown below per chromosome. 

    chrom=( chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY)

    for i in ${chrom[@]};do
    {
       java -jar $gatk HaplotypeCaller \
       -ERC GVCF \
       -R hg38.fa \
       -I ${sample}.recal.bam \
       -L ${i} \
       -O ${sample}.HC.${i}.g.vcf.gz
    }
    done

    CombineGVCF command shown below per chromosome:

    for i in ${chrom[@]};do
            {
                sample_gvcfs=""
                for sample in `cat ${sample_list}`;do
                    sample_gvcfs=${sample_gvcfs}"--variant ${sample}.HC.${i}.g.vcf.gz "
                done
                
                $gatk --java-options "-Xmx40g -XX:ParallelGCThreads=4" CombineGVCFs \
                    -R hg38.fa \
                    ${sample_gvcfs} \
                    -O all.sample.HC.${i}.merge.g.vcf.gz \
                $gatk --java-options "-Xmx40g"  GenotypeGVCFs \
                    -R hg38.fa \
                    -V all.sample.HC.${i}.merge.g.vcf.gz \
                    -O all.sample.HC.${i}.merge.vcf.gz
            }
    done

     

    “... recommends GenomicsDBImport tool ...”

    I use chr7 as a test:

    chrom=( chr7 )
    for i in ${chrom[@]};do
    {

       sample_gvcfs=""
                for sample in `cat ${sample_list}`;do
                    sample_gvcfs=${sample_gvcfs}"--variant ${sample}.HC.${i}.g.vcf.gz "
                done

    $gatk --java-options "-Xmx40g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true"  GenomicsDBImport \
            -R hg38.fa \
            ${sample_gvcfs} \
            --genomicsdb-workspace-path chrom_merge_vcf_5 \
            --tmp-dir tmp \
            --max-num-intervals-to-import-in-parallel 32 \
            --genomicsdb-shared-posixfs-optimizations true \
            --reader-threads 10 \
            --intervals chr7
    }
    done
        

    GenomicsDBImport process has already running ~51 hours, It's not finished yet.

    Entire program log: Below is the total of the stack trace after ~51hrs of the job:

    17:03:59.676 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/public/home/bioinfo_wang/00_software/miniconda3/envs/bioinfo_wang/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    17:04:00.128 INFO  GenomicsDBImport - ------------------------------------------------------------
    17:04:00.129 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.6.1
    17:04:00.129 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
    17:04:00.129 INFO  GenomicsDBImport - Executing as bioinfo_wang@sir1c6u5-comput116 on Linux v3.10.0-957.el7.x86_64 amd64
    17:04:00.129 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_181-b13
    17:04:00.129 INFO  GenomicsDBImport - Start Date/Time: February 17, 2024 5:03:59 PM CST
    17:04:00.130 INFO  GenomicsDBImport - ------------------------------------------------------------
    17:04:00.130 INFO  GenomicsDBImport - ------------------------------------------------------------
    17:04:00.130 INFO  GenomicsDBImport - HTSJDK Version: 2.24.1
    17:04:00.130 INFO  GenomicsDBImport - Picard Version: 2.27.1
    17:04:00.131 INFO  GenomicsDBImport - Built for Spark Version: 2.4.5
    17:04:00.131 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    17:04:00.131 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    17:04:00.131 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    17:04:00.131 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    17:04:00.131 INFO  GenomicsDBImport - Deflater: IntelDeflater
    17:04:00.131 INFO  GenomicsDBImport - Inflater: IntelInflater
    17:04:00.131 INFO  GenomicsDBImport - GCS max retries/reopens: 20
    17:04:00.131 INFO  GenomicsDBImport - Requester pays: disabled
    17:04:00.132 INFO  GenomicsDBImport - Initializing engine
    17:04:32.919 WARN  IntelInflater - Zero Bytes Written : 0
    17:04:33.053 WARN  IntelInflater - Zero Bytes Written : 0
    17:04:39.260 INFO  IntervalArgumentCollection - Processing 159345973 bp from intervals
    17:04:39.491 INFO  GenomicsDBImport - Done initializing engine
    17:04:40.044 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
    17:04:40.051 INFO  GenomicsDBImport - Vid Map JSON file will be written to /public/home/bioinfo_wang/02_project/06_ShiFuYou/04_snp/chrom_merge_vcf_5/vidmap.json
    17:04:40.052 INFO  GenomicsDBImport - Callset Map JSON file will be written to /public/home/bioinfo_wang/02_project/06_ShiFuYou/04_snp/chrom_merge_vcf_5/callset.json
    17:04:40.052 INFO  GenomicsDBImport - Complete VCF Header will be written to /public/home/bioinfo_wang/02_project/06_ShiFuYou/04_snp/chrom_merge_vcf_5/vcfheader.vcf
    17:04:40.053 INFO  GenomicsDBImport - Importing to workspace - /public/home/bioinfo_wang/02_project/06_ShiFuYou/04_snp/chrom_merge_vcf_5
    17:04:54.129 WARN  IntelInflater - Zero Bytes Written : 0
    17:04:54.151 WARN  IntelInflater - Zero Bytes Written : 0
    17:04:54.407 INFO  GenomicsDBImport - Importing batch 1 with 635 samples

     

    the outdir chrom_merge_vcf_5: 

    [wang@21:04:21 chrom_merge_vcf_5]$ll
    total 148K
    drwx------ 4 wang acksd2rqg4 8.0K Feb 17 17:05 chr7$1$159345973
    -rwx------ 1 wang acksd2rqg4    0 Feb 17 17:04 __tiledb_workspace.tdb
    -rwx------ 1 wang acksd2rqg4 100K Feb 17 17:04 vcfheader.vcf
    -rwx------ 1 wang acksd2rqg4  40K Feb 17 17:04 vidmap.json

    [wang@21:15:05 chr7$1$159345973]$ls -al
    total 33
    drwx------ 4 wang acksd2rqg4 8192 Feb 17 17:05 .
    drwx------ 3 wang acksd2rqg4 8192 Feb 17 17:05 ..
    drwx------ 2 wang acksd2rqg4 8192 Feb 17 17:05 .__9972363a-2dc0-4b59-873d-1e60d362ef1247323956225792_1708160735373
    -rwx------ 1 wang acksd2rqg4  592 Feb 17 17:05 __array_schema.tdb
    -rwx------ 1 wang acksd2rqg4    0 Feb 17 17:05 .__consolidation_lock
    drwx------ 2 wang acksd2rqg4 8192 Feb 17 17:05 genomicsdb_meta_dir

     

    [wang@21:16:20 .__9972363a-2dc0-4b59-873d-1e60d362ef1247323956225792_1708160735373]$ll
    total 111G
    -rwx------ 1 wang acksd2rqg4 586M Feb 19 21:16 AD.tdb
    -rwx------ 1 wang acksd2rqg4 262M Feb 19 21:16 AD_var.tdb
    -rwx------ 1 wang acksd2rqg4 706M Feb 19 21:16 ALT.tdb
    -rwx------ 1 wang acksd2rqg4 303M Feb 19 21:16 ALT_var.tdb
    -rwx------ 1 wang acksd2rqg4 652M Feb 19 21:16 BaseQRankSum.tdb
    -rwx------ 1 wang acksd2rqg4  15G Feb 19 21:16 __coords.tdb
    -rwx------ 1 wang acksd2rqg4 9.7G Feb 19 21:16 DP_FORMAT.tdb
    -rwx------ 1 wang acksd2rqg4 570M Feb 19 21:16 DP.tdb
    -rwx------ 1 wang acksd2rqg4  17G Feb 19 21:16 END.tdb
    -rwx------ 1 wang acksd2rqg4 490M Feb 19 21:16 ExcessHet.tdb
    -rwx------ 1 wang acksd2rqg4 360M Feb 19 21:16 FILTER.tdb
    -rwx------ 1 wang acksd2rqg4  11G Feb 19 21:16 GQ.tdb
    -rwx------ 1 wang acksd2rqg4 614M Feb 19 21:16 GT.tdb
    -rwx------ 1 wang acksd2rqg4 581M Feb 19 21:16 GT_var.tdb
    -rwx------ 1 wang acksd2rqg4 360M Feb 19 21:16 ID.tdb
    -rwx------ 1 wang acksd2rqg4 385M Feb 19 21:16 InbreedingCoeff.tdb
    -rwx------ 1 wang acksd2rqg4 9.7G Feb 19 21:16 MIN_DP.tdb
    -rwx------ 1 wang acksd2rqg4 584M Feb 19 21:16 MLEAC.tdb
    -rwx------ 1 wang acksd2rqg4 113M Feb 19 21:16 MLEAC_var.tdb
    -rwx------ 1 wang acksd2rqg4 584M Feb 19 21:16 MLEAF.tdb
    -rwx------ 1 wang acksd2rqg4 129M Feb 19 21:16 MLEAF_var.tdb
    -rwx------ 1 wang acksd2rqg4 501M Feb 19 21:16 MQRankSum.tdb
    -rwx------ 1 wang acksd2rqg4 463M Feb 19 21:16 PGT.tdb
    -rwx------ 1 wang acksd2rqg4  11M Feb 19 21:16 PGT_var.tdb
    -rwx------ 1 wang acksd2rqg4 476M Feb 19 21:16 PID.tdb
    -rwx------ 1 wang acksd2rqg4  21M Feb 19 21:16 PID_var.tdb
    -rwx------ 1 wang acksd2rqg4 731M Feb 19 21:16 PL.tdb
    -rwx------ 1 wang acksd2rqg4  34G Feb 19 21:16 PL_var.tdb
    -rwx------ 1 wang acksd2rqg4 392M Feb 19 21:16 PS.tdb
    -rwx------ 1 wang acksd2rqg4 779M Feb 19 21:16 QUAL.tdb
    -rwx------ 1 wang acksd2rqg4 735M Feb 19 21:16 RAW_MQandDP.tdb
    -rwx------ 1 wang acksd2rqg4 660M Feb 19 21:16 ReadPosRankSum.tdb
    -rwx------ 1 wang acksd2rqg4 612M Feb 19 21:16 REF.tdb
    -rwx------ 1 wang acksd2rqg4 3.6G Feb 19 21:16 REF_var.tdb
    -rwx------ 1 wang acksd2rqg4 840M Feb 19 21:16 SB.tdb

    [wang@21:17:10 genomicsdb_meta_dir]$ll
    total 1.0K
    -rwx------ 1 wang acksd2rqg4 62 Feb 17 17:05 genomicsdb_column_bounds.json
    -rwx------ 1 wang acksd2rqg4 62 Feb 17 17:05 genomicsdb_meta_8a9b100c-753b-4793-a000-3918c7f4c3b3.json

     

    What should the GenomicsDBImport  resulting file include ? How long does this process take about ?

    0
    Comment actions Permalink
  • Avatar
    XiuChao Wang

    the outdir chrom_merge_vcf_5: 

    [wang@21:32:28 chrom_merge_vcf_5]$tree -a
    .
    ├── chr7$1$159345973
    │   ├── .__9972363a-2dc0-4b59-873d-1e60d362ef1247323956225792_1708160735373
    │   │   ├── AD.tdb
    │   │   ├── AD_var.tdb
    │   │   ├── ALT.tdb
    │   │   ├── ALT_var.tdb
    │   │   ├── BaseQRankSum.tdb
    │   │   ├── __coords.tdb
    │   │   ├── DP_FORMAT.tdb
    │   │   ├── DP.tdb
    │   │   ├── END.tdb
    │   │   ├── ExcessHet.tdb
    │   │   ├── FILTER.tdb
    │   │   ├── GQ.tdb
    │   │   ├── GT.tdb
    │   │   ├── GT_var.tdb
    │   │   ├── ID.tdb
    │   │   ├── InbreedingCoeff.tdb
    │   │   ├── MIN_DP.tdb
    │   │   ├── MLEAC.tdb
    │   │   ├── MLEAC_var.tdb
    │   │   ├── MLEAF.tdb
    │   │   ├── MLEAF_var.tdb
    │   │   ├── MQRankSum.tdb
    │   │   ├── PGT.tdb
    │   │   ├── PGT_var.tdb
    │   │   ├── PID.tdb
    │   │   ├── PID_var.tdb
    │   │   ├── PL.tdb
    │   │   ├── PL_var.tdb
    │   │   ├── PS.tdb
    │   │   ├── QUAL.tdb
    │   │   ├── RAW_MQandDP.tdb
    │   │   ├── ReadPosRankSum.tdb
    │   │   ├── REF.tdb
    │   │   ├── REF_var.tdb
    │   │   └── SB.tdb
    │   ├── __array_schema.tdb
    │   ├── .__consolidation_lock
    │   └── genomicsdb_meta_dir
    │       ├── genomicsdb_column_bounds.json
    │       └── genomicsdb_meta_8a9b100c-753b-4793-a000-3918c7f4c3b3.json
    ├── __tiledb_workspace.tdb
    ├── vcfheader.vcf
    └── vidmap.json

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi again.

    Both tools will suffer from trying to load all samples all at once. GenomicsDBImport has a feature to load samples to genomics db in batches therefore tool will consume less memory and will work faster. 

    --batch-size <Integer>

    The above parameter is worth trying if you are already facing this issue. 

     

    By the way you don't have to feed every single sample with a --variant parameter to this tool. You can provide a sample list file using the parameter below. 

    --sample-name-map <String>

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    XiuChao Wang

    thanks for your reply!

    1. The contents of the output folder shown above, are they correct ? 

    2. "--sample-name-map map.list"   , 

    map.list content like this, is there a problem with this format? 

    CZEY001    CZEY001.HC.chr6.g.vcf.gz
    CZEY002    CZEY002.HC.chr6.g.vcf.gz
    CZEY003    CZEY003.HC.chr6.g.vcf.gz
    CZEY004    CZEY004.HC.chr6.g.vcf.gz
    CZEY005    CZEY005.HC.chr6.g.vcf.gz
    CZEY006    CZEY006.HC.chr6.g.vcf.gz
    CZEY007    CZEY007.HC.chr6.g.vcf.gz
    CZEY008    CZEY008.HC.chr6.g.vcf.gz
    CZEY009    CZEY009.HC.chr6.g.vcf.gz

    0
    Comment actions Permalink
  • Avatar
    XiuChao Wang

    GenomicsDBImport command shown below:

    Is this format reasonable ?

    chrom=( chr6 chr7 chr8 chr9 chrX )
        for i in ${chrom[@]};do
            {

                $gatk --java-options "-Xmx100g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenomicsDBImport \
                    -R ${ref} \
                    --sample-name-map ${i}.list \
                    --genomicsdb-workspace-path chrom_merge_vcf_${i} \
                    --tmp-dir tmp \
                    --max-num-intervals-to-import-in-parallel 64 \
                    --genomicsdb-shared-posixfs-optimizations true \
                    --reader-threads 50 \
                    --batch-size 100 \
                    --intervals ${i}

            }
        done

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi 

    Contents of the genomicsdb folder looks OK. Those commands also look fine. You may want to tweak the number of samples per batch if your problem still persists. But even if this does not solve your issues then we may need to check for additional issues that may be present with those contigs. 

    I hope this helps. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk