CombineGVCF on chr6,7,8,9,X hg38 running for ever~
REQUIRED for all errors and issues:
a) GATK version used: gatk4-4.2.6.1
b) Exact command used: gatk --java-options "-Xmx40g" CombineGVCFs \
-R ${ref} \
${sample_gvcfs} \ # 630 samples list
-O all.sample.HC.${i}.merge.g.vcf.gz
c) Entire program log: Below is the tail of the stack trace after ~70hrs of the job:(chr6)
16:30:00.624 INFO ProgressMeter - chr6:36507319 4223.2 1766373000 418251.5
16:30:13.116 INFO ProgressMeter - chr6:36507362 4223.4 1766375000 418231.4
16:30:23.249 INFO ProgressMeter - chr6:36507397 4223.6 1766377000 418215.1
16:30:37.261 INFO ProgressMeter - chr6:36507445 4223.8 1766380000 418192.7
16:30:52.617 INFO ProgressMeter - chr6:36507498 4224.1 1766383000 418168.1
16:31:07.364 INFO ProgressMeter - chr6:36507549 4224.3 1766386000 418144.5
16:31:22.177 INFO ProgressMeter - chr6:36507600 4224.6 1766389000 418120.7
16:31:37.624 INFO ProgressMeter - chr6:36507653 4224.8 1766392000 418096.0
16:31:48.037 INFO ProgressMeter - chr6:36507689 4225.0 1766394000 418079.3
16:31:58.794 INFO ProgressMeter - chr6:36507726 4225.2 1766396000 418062.0
Q: The issue is that CombineGVCF generates the combined GVCF on all autosomal chromosomes except chr6,7,8,9,X. but chr6,7,8,9,X, combine GVCFs keeps running for 3-4 days and still pending the completion.
For chr6 combine GVCFs , in the first ~20 hours, the GVCF file produce 66G of data, for the next 50 hours, it only produce ~1G of data. The log file from CombineGVCF looks normal except that the `ProgressMeter` is quiet slow.
For chr7,8,9,X. it only produce ~60M of data after ~10 hours.
I have re-run this 2-3 times already . The situation remains the same. have any suggestions?
-
Hi XiuChao Wang
Can you elaborate a little more on your compute environment?
Also CombineGVCFs tool is although useful combining too many samples is not a recommended way of operating joint genotyping using this tool. Instead GATK team recommends GenomicsDBImport tool which is way faster (ability to load in batches and using multi-threaded gvcf read) and more robust for more samples.
Regards.
-
“Can you elaborate a little more on your compute environment? ” :
HaplotypeCaller Command shown below per chromosome.
chrom=( chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY)
for i in ${chrom[@]};do
{
java -jar $gatk HaplotypeCaller \
-ERC GVCF \
-R hg38.fa \
-I ${sample}.recal.bam \
-L ${i} \
-O ${sample}.HC.${i}.g.vcf.gz
}
doneCombineGVCF command shown below per chromosome:
for i in ${chrom[@]};do
{
sample_gvcfs=""
for sample in `cat ${sample_list}`;do
sample_gvcfs=${sample_gvcfs}"--variant ${sample}.HC.${i}.g.vcf.gz "
done
$gatk --java-options "-Xmx40g -XX:ParallelGCThreads=4" CombineGVCFs \
-R hg38.fa \
${sample_gvcfs} \
-O all.sample.HC.${i}.merge.g.vcf.gz \
$gatk --java-options "-Xmx40g" GenotypeGVCFs \
-R hg38.fa \
-V all.sample.HC.${i}.merge.g.vcf.gz \
-O all.sample.HC.${i}.merge.vcf.gz
}
done“... recommends GenomicsDBImport tool ...”
I use chr7 as a test:
chrom=( chr7 )
for i in ${chrom[@]};do
{sample_gvcfs=""
for sample in `cat ${sample_list}`;do
sample_gvcfs=${sample_gvcfs}"--variant ${sample}.HC.${i}.g.vcf.gz "
done$gatk --java-options "-Xmx40g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenomicsDBImport \
-R hg38.fa \
${sample_gvcfs} \
--genomicsdb-workspace-path chrom_merge_vcf_5 \
--tmp-dir tmp \
--max-num-intervals-to-import-in-parallel 32 \
--genomicsdb-shared-posixfs-optimizations true \
--reader-threads 10 \
--intervals chr7
}
done
GenomicsDBImport process has already running ~51 hours, It's not finished yet.
Entire program log: Below is the total of the stack trace after ~51hrs of the job:
17:03:59.676 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/public/home/bioinfo_wang/00_software/miniconda3/envs/bioinfo_wang/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:04:00.128 INFO GenomicsDBImport - ------------------------------------------------------------
17:04:00.129 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.6.1
17:04:00.129 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
17:04:00.129 INFO GenomicsDBImport - Executing as bioinfo_wang@sir1c6u5-comput116 on Linux v3.10.0-957.el7.x86_64 amd64
17:04:00.129 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_181-b13
17:04:00.129 INFO GenomicsDBImport - Start Date/Time: February 17, 2024 5:03:59 PM CST
17:04:00.130 INFO GenomicsDBImport - ------------------------------------------------------------
17:04:00.130 INFO GenomicsDBImport - ------------------------------------------------------------
17:04:00.130 INFO GenomicsDBImport - HTSJDK Version: 2.24.1
17:04:00.130 INFO GenomicsDBImport - Picard Version: 2.27.1
17:04:00.131 INFO GenomicsDBImport - Built for Spark Version: 2.4.5
17:04:00.131 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:04:00.131 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:04:00.131 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:04:00.131 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:04:00.131 INFO GenomicsDBImport - Deflater: IntelDeflater
17:04:00.131 INFO GenomicsDBImport - Inflater: IntelInflater
17:04:00.131 INFO GenomicsDBImport - GCS max retries/reopens: 20
17:04:00.131 INFO GenomicsDBImport - Requester pays: disabled
17:04:00.132 INFO GenomicsDBImport - Initializing engine
17:04:32.919 WARN IntelInflater - Zero Bytes Written : 0
17:04:33.053 WARN IntelInflater - Zero Bytes Written : 0
17:04:39.260 INFO IntervalArgumentCollection - Processing 159345973 bp from intervals
17:04:39.491 INFO GenomicsDBImport - Done initializing engine
17:04:40.044 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
17:04:40.051 INFO GenomicsDBImport - Vid Map JSON file will be written to /public/home/bioinfo_wang/02_project/06_ShiFuYou/04_snp/chrom_merge_vcf_5/vidmap.json
17:04:40.052 INFO GenomicsDBImport - Callset Map JSON file will be written to /public/home/bioinfo_wang/02_project/06_ShiFuYou/04_snp/chrom_merge_vcf_5/callset.json
17:04:40.052 INFO GenomicsDBImport - Complete VCF Header will be written to /public/home/bioinfo_wang/02_project/06_ShiFuYou/04_snp/chrom_merge_vcf_5/vcfheader.vcf
17:04:40.053 INFO GenomicsDBImport - Importing to workspace - /public/home/bioinfo_wang/02_project/06_ShiFuYou/04_snp/chrom_merge_vcf_5
17:04:54.129 WARN IntelInflater - Zero Bytes Written : 0
17:04:54.151 WARN IntelInflater - Zero Bytes Written : 0
17:04:54.407 INFO GenomicsDBImport - Importing batch 1 with 635 samplesthe outdir chrom_merge_vcf_5:
[wang@21:04:21 chrom_merge_vcf_5]$ll
total 148K
drwx------ 4 wang acksd2rqg4 8.0K Feb 17 17:05 chr7$1$159345973
-rwx------ 1 wang acksd2rqg4 0 Feb 17 17:04 __tiledb_workspace.tdb
-rwx------ 1 wang acksd2rqg4 100K Feb 17 17:04 vcfheader.vcf
-rwx------ 1 wang acksd2rqg4 40K Feb 17 17:04 vidmap.json[wang@21:15:05 chr7$1$159345973]$ls -al
total 33
drwx------ 4 wang acksd2rqg4 8192 Feb 17 17:05 .
drwx------ 3 wang acksd2rqg4 8192 Feb 17 17:05 ..
drwx------ 2 wang acksd2rqg4 8192 Feb 17 17:05 .__9972363a-2dc0-4b59-873d-1e60d362ef1247323956225792_1708160735373
-rwx------ 1 wang acksd2rqg4 592 Feb 17 17:05 __array_schema.tdb
-rwx------ 1 wang acksd2rqg4 0 Feb 17 17:05 .__consolidation_lock
drwx------ 2 wang acksd2rqg4 8192 Feb 17 17:05 genomicsdb_meta_dir[wang@21:16:20 .__9972363a-2dc0-4b59-873d-1e60d362ef1247323956225792_1708160735373]$ll
total 111G
-rwx------ 1 wang acksd2rqg4 586M Feb 19 21:16 AD.tdb
-rwx------ 1 wang acksd2rqg4 262M Feb 19 21:16 AD_var.tdb
-rwx------ 1 wang acksd2rqg4 706M Feb 19 21:16 ALT.tdb
-rwx------ 1 wang acksd2rqg4 303M Feb 19 21:16 ALT_var.tdb
-rwx------ 1 wang acksd2rqg4 652M Feb 19 21:16 BaseQRankSum.tdb
-rwx------ 1 wang acksd2rqg4 15G Feb 19 21:16 __coords.tdb
-rwx------ 1 wang acksd2rqg4 9.7G Feb 19 21:16 DP_FORMAT.tdb
-rwx------ 1 wang acksd2rqg4 570M Feb 19 21:16 DP.tdb
-rwx------ 1 wang acksd2rqg4 17G Feb 19 21:16 END.tdb
-rwx------ 1 wang acksd2rqg4 490M Feb 19 21:16 ExcessHet.tdb
-rwx------ 1 wang acksd2rqg4 360M Feb 19 21:16 FILTER.tdb
-rwx------ 1 wang acksd2rqg4 11G Feb 19 21:16 GQ.tdb
-rwx------ 1 wang acksd2rqg4 614M Feb 19 21:16 GT.tdb
-rwx------ 1 wang acksd2rqg4 581M Feb 19 21:16 GT_var.tdb
-rwx------ 1 wang acksd2rqg4 360M Feb 19 21:16 ID.tdb
-rwx------ 1 wang acksd2rqg4 385M Feb 19 21:16 InbreedingCoeff.tdb
-rwx------ 1 wang acksd2rqg4 9.7G Feb 19 21:16 MIN_DP.tdb
-rwx------ 1 wang acksd2rqg4 584M Feb 19 21:16 MLEAC.tdb
-rwx------ 1 wang acksd2rqg4 113M Feb 19 21:16 MLEAC_var.tdb
-rwx------ 1 wang acksd2rqg4 584M Feb 19 21:16 MLEAF.tdb
-rwx------ 1 wang acksd2rqg4 129M Feb 19 21:16 MLEAF_var.tdb
-rwx------ 1 wang acksd2rqg4 501M Feb 19 21:16 MQRankSum.tdb
-rwx------ 1 wang acksd2rqg4 463M Feb 19 21:16 PGT.tdb
-rwx------ 1 wang acksd2rqg4 11M Feb 19 21:16 PGT_var.tdb
-rwx------ 1 wang acksd2rqg4 476M Feb 19 21:16 PID.tdb
-rwx------ 1 wang acksd2rqg4 21M Feb 19 21:16 PID_var.tdb
-rwx------ 1 wang acksd2rqg4 731M Feb 19 21:16 PL.tdb
-rwx------ 1 wang acksd2rqg4 34G Feb 19 21:16 PL_var.tdb
-rwx------ 1 wang acksd2rqg4 392M Feb 19 21:16 PS.tdb
-rwx------ 1 wang acksd2rqg4 779M Feb 19 21:16 QUAL.tdb
-rwx------ 1 wang acksd2rqg4 735M Feb 19 21:16 RAW_MQandDP.tdb
-rwx------ 1 wang acksd2rqg4 660M Feb 19 21:16 ReadPosRankSum.tdb
-rwx------ 1 wang acksd2rqg4 612M Feb 19 21:16 REF.tdb
-rwx------ 1 wang acksd2rqg4 3.6G Feb 19 21:16 REF_var.tdb
-rwx------ 1 wang acksd2rqg4 840M Feb 19 21:16 SB.tdb[wang@21:17:10 genomicsdb_meta_dir]$ll
total 1.0K
-rwx------ 1 wang acksd2rqg4 62 Feb 17 17:05 genomicsdb_column_bounds.json
-rwx------ 1 wang acksd2rqg4 62 Feb 17 17:05 genomicsdb_meta_8a9b100c-753b-4793-a000-3918c7f4c3b3.jsonWhat should the GenomicsDBImport resulting file include ? How long does this process take about ?
-
the outdir chrom_merge_vcf_5:
[wang@21:32:28 chrom_merge_vcf_5]$tree -a
.
├── chr7$1$159345973
│ ├── .__9972363a-2dc0-4b59-873d-1e60d362ef1247323956225792_1708160735373
│ │ ├── AD.tdb
│ │ ├── AD_var.tdb
│ │ ├── ALT.tdb
│ │ ├── ALT_var.tdb
│ │ ├── BaseQRankSum.tdb
│ │ ├── __coords.tdb
│ │ ├── DP_FORMAT.tdb
│ │ ├── DP.tdb
│ │ ├── END.tdb
│ │ ├── ExcessHet.tdb
│ │ ├── FILTER.tdb
│ │ ├── GQ.tdb
│ │ ├── GT.tdb
│ │ ├── GT_var.tdb
│ │ ├── ID.tdb
│ │ ├── InbreedingCoeff.tdb
│ │ ├── MIN_DP.tdb
│ │ ├── MLEAC.tdb
│ │ ├── MLEAC_var.tdb
│ │ ├── MLEAF.tdb
│ │ ├── MLEAF_var.tdb
│ │ ├── MQRankSum.tdb
│ │ ├── PGT.tdb
│ │ ├── PGT_var.tdb
│ │ ├── PID.tdb
│ │ ├── PID_var.tdb
│ │ ├── PL.tdb
│ │ ├── PL_var.tdb
│ │ ├── PS.tdb
│ │ ├── QUAL.tdb
│ │ ├── RAW_MQandDP.tdb
│ │ ├── ReadPosRankSum.tdb
│ │ ├── REF.tdb
│ │ ├── REF_var.tdb
│ │ └── SB.tdb
│ ├── __array_schema.tdb
│ ├── .__consolidation_lock
│ └── genomicsdb_meta_dir
│ ├── genomicsdb_column_bounds.json
│ └── genomicsdb_meta_8a9b100c-753b-4793-a000-3918c7f4c3b3.json
├── __tiledb_workspace.tdb
├── vcfheader.vcf
└── vidmap.json -
Hi again.
Both tools will suffer from trying to load all samples all at once. GenomicsDBImport has a feature to load samples to genomics db in batches therefore tool will consume less memory and will work faster.
--batch-size <Integer>
The above parameter is worth trying if you are already facing this issue.
By the way you don't have to feed every single sample with a --variant parameter to this tool. You can provide a sample list file using the parameter below.
--sample-name-map <String>
I hope this helps.
-
thanks for your reply!
1. The contents of the output folder shown above, are they correct ?
2. "--sample-name-map map.list" ,
map.list content like this, is there a problem with this format?
CZEY001 CZEY001.HC.chr6.g.vcf.gz
CZEY002 CZEY002.HC.chr6.g.vcf.gz
CZEY003 CZEY003.HC.chr6.g.vcf.gz
CZEY004 CZEY004.HC.chr6.g.vcf.gz
CZEY005 CZEY005.HC.chr6.g.vcf.gz
CZEY006 CZEY006.HC.chr6.g.vcf.gz
CZEY007 CZEY007.HC.chr6.g.vcf.gz
CZEY008 CZEY008.HC.chr6.g.vcf.gz
CZEY009 CZEY009.HC.chr6.g.vcf.gz -
GenomicsDBImport command shown below:
Is this format reasonable ?
chrom=( chr6 chr7 chr8 chr9 chrX )
for i in ${chrom[@]};do
{$gatk --java-options "-Xmx100g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenomicsDBImport \
-R ${ref} \
--sample-name-map ${i}.list \
--genomicsdb-workspace-path chrom_merge_vcf_${i} \
--tmp-dir tmp \
--max-num-intervals-to-import-in-parallel 64 \
--genomicsdb-shared-posixfs-optimizations true \
--reader-threads 50 \
--batch-size 100 \
--intervals ${i}}
done -
Hi
Contents of the genomicsdb folder looks OK. Those commands also look fine. You may want to tweak the number of samples per batch if your problem still persists. But even if this does not solve your issues then we may need to check for additional issues that may be present with those contigs.
I hope this helps.
Please sign in to leave a comment.
7 comments