GenotypeGVCFs generates empty vcf.gz files
Hello,
I run GenomicsDBImport and then GenotypeGVCFs. As output files, I got vcf.gz files which have only header, and do not contain any variants. However, the variants are present in the input and output files at HaplotypeCaller step.
Could you please help me with this?
All the .vcf.gz files output files have no variants and look like this.
bcftools view allSamples_joint_Chr1A.vcf.gz | tail
##contig=<ID=Chr7A,length=744491536>
##contig=<ID=Chr7B,length=764081788>
##contig=<ID=Chr7D,length=642921167>
##contig=<ID=ChrUnknown,length=351582993>
##source=GenomicsDBImport
##source=GenotypeGVCFs
##source=HaplotypeCaller
##bcftools_viewVersion=1.19+htslib-1.19
##bcftools_viewCommand=view allSamples_joint_Chr1A.vcf.gz; Date=Tue Apr 2 19:11:27 2024
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT line_1 line_2 line_3 line_4 … line_33
My codes are:
parallel -j ${CPU} "java -XX:ParallelGCThreads=1 \
-DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
-jar ${prg}/gatk-package-4.5.0.0-local.jar HaplotypeCaller \
--reference ${ref_fasta} \
--input ${in_dir}/{}.sort.dedup.grp.pass2.realigned.BQSR.bam \
-ERC GVCF \
--output ${out_dir}/{}_BQSR_fromPass2_g.vcf.gz \
2>${log_dir}/{}.BQSR.g.vcf.log" ::: $(ls -1 *.sort.dedup.grp.pass2.realigned.BQSR.bam | sed 's/.sort.dedup.grp.pass2.realigned.BQSR.bam//')
parallel -j ${CPU} "java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
-jar ${prg}/gatk-package-4.5.0.0-local.jar GenomicsDBImport \
--genomicsdb-workspace-path ${gDB}_{} \
--batch-size 32 \
-L {} \
--sample-name-map ${map} \
--tmp-dir ${tmp} \
--reader-threads 6 \
2>${log_dir}/{}.gDBImport.log" ::: Chr{1..7}A Chr{1..7}B Chr{1..7}D
parallel -j ${CPU} "java \"-Xmx16g\" -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
-jar ${prg}/gatk-package-4.5.0.0-local.jar GenotypeGVCFs \
--reference ${ref_fasta} \
-V gendb://${gDB}_{} \
--output ${dir}/allSamples_joint_{}.vcf.gz \
2>${log_dir}/{}.joint_GenotypeGVCFs.log" ::: Chr{1..7}A Chr{1..7}B Chr{1..7}D
This is the content of the one (Ch1A) of the .gDBImport.log files.
13:13:47.637 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user1/Programs/gatk-4.5.0.0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
13:13:48.115 INFO GenomicsDBImport - ------------------------------------------------------------
13:13:48.123 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.5.0.0
13:13:48.132 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
13:13:48.133 INFO GenomicsDBImport - Executing as user1@agriserver.ca
on Linux v5.4.0-150-generic amd64
13:13:48.133 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
13:13:48.134 INFO GenomicsDBImport - Start Date/Time: April 2, 2024 at 1:13:47 p.m. CST
13:13:48.134 INFO GenomicsDBImport - ------------------------------------------------------------
13:13:48.134 INFO GenomicsDBImport - ------------------------------------------------------------
13:13:48.137 INFO GenomicsDBImport - HTSJDK Version: 4.1.0
13:13:48.138 INFO GenomicsDBImport - Picard Version: 3.1.1
13:13:48.139 INFO GenomicsDBImport - Built for Spark Version: 3.5.0
13:13:48.139 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:13:48.140 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:13:48.140 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:13:48.142 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:13:48.142 INFO GenomicsDBImport - Deflater: IntelDeflater
13:13:48.143 INFO GenomicsDBImport - Inflater: IntelInflater
13:13:48.143 INFO GenomicsDBImport - GCS max retries/reopens: 20
13:13:48.143 INFO GenomicsDBImport - Requester pays: disabled
13:13:48.144 INFO GenomicsDBImport - Initializing engine
13:13:48.461 INFO IntervalArgumentCollection - Processing 598660471 bp from intervals
13:13:48.463 INFO GenomicsDBImport - Done initializing engine
13:13:57.917 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.5.1-84e800e
13:13:57.924 INFO GenomicsDBImport - Vid Map JSON file will be written to /mnt/514453/user1/d07_GATK4_joint/genomicDB4jointcalling_Chr1A/vidmap.json
13:13:57.925 INFO GenomicsDBImport - Callset Map JSON file will be written to /mnt/514453/user1/d07_GATK4_joint/genomicDB4jointcalling_Chr1A/callset.json
13:13:57.926 INFO GenomicsDBImport - Complete VCF Header will be written to /mnt/514453/user1/d07_GATK4_joint/genomicDB4jointcalling_Chr1A/vcfheader.vcf
13:13:57.927 INFO GenomicsDBImport - Importing to workspace - /mnt/514453/user1/d07_GATK4_joint/genomicDB4jointcalling_Chr1A
13:13:58.222 INFO GenomicsDBImport - Starting batch input file preload
13:13:58.378 INFO GenomicsDBImport - Finished batch preload
13:13:58.381 INFO GenomicsDBImport - Importing batch 1 with 32 samples
13:13:59.096 INFO GenomicsDBImport - Done importing batch 1/2
13:13:59.097 INFO GenomicsDBImport - Starting batch input file preload
13:13:59.101 INFO GenomicsDBImport - Finished batch preload
13:13:59.101 INFO GenomicsDBImport - Importing batch 2 with 1 samples
13:13:59.129 INFO GenomicsDBImport - Done importing batch 2/2
13:13:59.138 INFO GenomicsDBImport - Import of all batches to GenomicsDB completed!
13:13:59.139 INFO GenomicsDBImport - Shutting down engine
[April 2, 2024 at 1:13:59 p.m. CST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.19 minutes.
Runtime.totalMemory()=285212672
The content of the .joint_GenotypeGVCFs.log files looks like this:
19:20:53.671 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user1/Programs/gatk-4.5.0.0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
19:20:54.039 INFO GenotypeGVCFs - ------------------------------------------------------------
19:20:54.048 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.5.0.0
19:20:54.052 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
19:20:54.052 INFO GenotypeGVCFs - Executing as user1@agriserver.ca on Linux v5.4.0-150-generic amd64
19:20:54.053 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
19:20:54.053 INFO GenotypeGVCFs - Start Date/Time: April 2, 2024 at 7:20:53 p.m. CST
19:20:54.053 INFO GenotypeGVCFs - ------------------------------------------------------------
19:20:54.054 INFO GenotypeGVCFs - ------------------------------------------------------------
19:20:54.056 INFO GenotypeGVCFs - HTSJDK Version: 4.1.0
19:20:54.061 INFO GenotypeGVCFs - Picard Version: 3.1.1
19:20:54.061 INFO GenotypeGVCFs - Built for Spark Version: 3.5.0
19:20:54.062 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
19:20:54.062 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
19:20:54.063 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
19:20:54.064 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
19:20:54.069 INFO GenotypeGVCFs - Deflater: IntelDeflater
19:20:54.071 INFO GenotypeGVCFs - Inflater: IntelInflater
19:20:54.072 INFO GenotypeGVCFs - GCS max retries/reopens: 20
19:20:54.072 INFO GenotypeGVCFs - Requester pays: disabled
19:20:54.073 INFO GenotypeGVCFs - Initializing engine
19:20:54.794 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.5.1-84e800e
19:20:54.848 INFO NativeGenomicsDB - pid=74047 tid=74052 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
19:20:54.848 INFO NativeGenomicsDB - pid=74047 tid=74052 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
19:20:54.848 INFO NativeGenomicsDB - pid=74047 tid=74052 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
19:20:54.885 INFO GenotypeGVCFs - Done initializing engine
19:20:54.930 INFO ProgressMeter - Starting traversal
19:20:54.931 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.0,Cpu time(s),0.0
19:20:55.197 INFO ProgressMeter - unmapped 0.0 0 0.0
19:20:55.198 INFO ProgressMeter - Traversal complete. Processed 0 total variants in 0.0 minutes.
19:20:55.203 INFO GenotypeGVCFs - Shutting down engine
[April 2, 2024 at 7:20:55 p.m. CST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=167772160
-
Hi E Ra
How were your bam files created? Which aligner did you use?
-
Hi Gokalp,
I used BWA aligner. My code was:
parallel -j $(CPU} bwa mem -t 32 "${ref_fasta}" {}_R1.fq.gz {}_R2.fq.gz "|" samtools view -b -S -h ">" {}.orig.bam ::: $(ls -1 *_R1.fq.gz | sed 's/_R1.fq.gz//')
Thank you,
-
Hi E Ra
Looks like your imports to GenomicsDB is happening within a second which is highly unlikely to be true. Most certainly there is an issue with your GVCF files or calls. Can you share a couple variants by using
bcftools view -H <gvcffile> | head -n 4
or
zcat <gvcffile> | grep -v @ | head -n 4
-
Hi Gökalp,
Below is the excerpt from a gvcfile
bcftools view -H line1_BQSR_fromPass2_g.vcf.gz | head -n 4
Chr1A 1 . C <NON_REF> . . END=34 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,25
Chr1A 35 . C <NON_REF> . . END=81 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,65
Chr1A 82 . C <NON_REF> . . END=104 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,98
Chr1A 105 . C <NON_REF> . . END=181 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,65 -
You seem to have all reference intervals. Do you have any variant sites present. Can you provide more lines for that?
-
Hi Gökalp,
Here are two variant sites.
Chr1A 15167 . G <NON_REF> . . END=15224 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,32
Chr1A 15225 . C <NON_REF> . . END=15292 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,49
Chr1A 15293 . C A,<NON_REF> 37.31 . DP=2;ExcessHet=0;MLEAC=1,0;MLEAF=0.5,0;RAW_MQandDP=1458,2 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:49,6,0,49,6,49:0,0,1,1
Chr1A 15294 . A <NON_REF> . . END=15314 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,49
Chr1A 15315 . T <NON_REF> . . END=15315 GT:DP:GQ:MIN_DP:PL 0/0:2:3:2:0,3,45
Chr1A 15316 . C <NON_REF> . . END=15372 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,64
Chr1A 15373 . G <NON_REF> . . END=15382 GT:DP:GQ:MIN_DP:PL 0/0:2:3:2:0,3,45
Chr1A 15383 . G <NON_REF> . . END=15461 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,61
Chr1A 15462 . A <NON_REF> . . END=15530 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,12
Chr1A 15531 . T <NON_REF> . . END=15707 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
Chr1A 15708 . C <NON_REF> . . END=15851 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,12
Chr1A 15852 . T <NON_REF> . . END=25438 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
Chr1A 25439 . C <NON_REF> . . END=25553 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,15
Chr1A 25554 . A <NON_REF> . . END=25555 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,49
Chr1A 25556 . G <NON_REF> . . END=25557 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
Chr1A 25558 . T <NON_REF> . . END=25568 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,36
Chr1A 25569 . G <NON_REF> . . END=25569 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
Chr1A 25570 . C <NON_REF> . . END=25579 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,36
Chr1A 25580 . C <NON_REF> . . END=25583 GT:DP:GQ:MIN_DP:PL 0/0:2:3:2:0,3,45
Chr1A 25584 . C T,<NON_REF> 15.45 . DP=1;ExcessHet=0;MLEAC=1,0;MLEAF=0.5,0;RAW_MQandDP=1600,1 GT:AD:DP:GQ:PL:SB 1/1:0,1,0:1:3:25,3,0,25,3,25:0,0,1,0
Chr1A 25585 . C <NON_REF> . . END=25590 GT:DP:GQ:MIN_DP:PL 0/0:2:3:2:0,3,45
Chr1A 25591 . G <NON_REF> . . END=25591 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
Chr1A 25592 . A <NON_REF> . . END=25596 GT:DP:GQ:MIN_DP:PL 0/0:2:3:1:0,3,31 -
Hi again.
Looks like your import process is somewhat incomplete.
Can you try importing one set of samples without using any parallel code that may interfere with the gatk operation?
-
Hi Gökalp,
I am not sure I understand what you meant by importing samples. I tried the codes below and I still do have no variants in the resulting vcf.gz file.
My codes were:
java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
-jar ${prg}/gatk-package-4.5.0.0-local.jar GenomicsDBImport \
--genomicsdb-workspace-path ${gDB}_Chr1A \
--batch-size 32 \
-L Chr1A \
--sample-name-map ${map} \
--tmp-dir ${tmp} \
--reader-threads 6 \
2>${log_dir}/Chr1A.gDBImport.log
java -Xmx16g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
-jar ${prg}/gatk-package-4.5.0.0-local.jar GenotypeGVCFs \
--reference ${ref_fasta} \
-V gendb://${gDB}_Chr1A \
--output ${dir}/allSamples_joint_Chr1A.vcf.gz \
2>${log_dir}/Chr1A.joint_GenotypeGVCFs.logThe log file is below, it seems not to display any error message.
11:24:12.380 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user1/Programs/gatk-4.5.0.0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:24:12.536 INFO GenotypeGVCFs - ------------------------------------------------------------
11:24:12.539 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.5.0.0
11:24:12.539 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
11:24:12.539 INFO GenotypeGVCFs - Executing as user1@agriserver.ca
on Linux v5.4.0-150-generic amd64
11:24:12.539 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
11:24:12.540 INFO GenotypeGVCFs - Start Date/Time: April 8, 2024 at 11:24:12 a.m. CST
11:24:12.540 INFO GenotypeGVCFs - ------------------------------------------------------------
11:24:12.540 INFO GenotypeGVCFs - ------------------------------------------------------------
11:24:12.541 INFO GenotypeGVCFs - HTSJDK Version: 4.1.0
11:24:12.541 INFO GenotypeGVCFs - Picard Version: 3.1.1
11:24:12.541 INFO GenotypeGVCFs - Built for Spark Version: 3.5.0
11:24:12.541 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:24:12.542 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:24:12.542 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:24:12.542 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:24:12.542 INFO GenotypeGVCFs - Deflater: IntelDeflater
11:24:12.542 INFO GenotypeGVCFs - Inflater: IntelInflater
11:24:12.543 INFO GenotypeGVCFs - GCS max retries/reopens: 20
11:24:12.543 INFO GenotypeGVCFs - Requester pays: disabled
11:24:12.543 INFO GenotypeGVCFs - Initializing engine
11:24:12.849 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.5.1-84e800e
11:24:12.919 INFO NativeGenomicsDB - pid=52042 tid=52043 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
11:24:12.919 INFO NativeGenomicsDB - pid=52042 tid=52043 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
11:24:12.919 INFO NativeGenomicsDB - pid=52042 tid=52043 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
11:24:12.951 INFO GenotypeGVCFs - Done initializing engine
11:24:13.000 INFO ProgressMeter - Starting traversal
11:24:13.001 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.0,Cpu time(s),0.0
11:24:13.204 INFO ProgressMeter - unmapped 0.0 0 0.0
11:24:13.205 INFO ProgressMeter - Traversal complete. Processed 0 total variants in 0.0 minutes.
11:24:13.209 INFO GenotypeGVCFs - Shutting down engine
[April 8, 2024 at 11:24:13 a.m. CST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=1157627904 -
Hi E Ra
I meant the GenomicsDBImport step. Do you get any error messages during that step?
When I looked at your previous logs this is what got my attention.
13:13:58.222 INFO GenomicsDBImport - Starting batch input file preload
13:13:58.378 INFO GenomicsDBImport - Finished batch preload
13:13:58.381 INFO GenomicsDBImport - Importing batch 1 with 32 samples
13:13:59.096 INFO GenomicsDBImport - Done importing batch 1/2Import of 32 samples takes only about less than a second to complete which is quite impossible considering you have many samples in there. It is possible that your samples need a CSI style VCF index and therefore import function does not work properly.
Unfortunately we do not currently support reading CSI style VCF index files. There is a workaround that you may want to use which is using glnexus to jointly genotype your GVCF files. We cannot provide any support for glnexus therefore your mileage may vary.
Regards.
Please sign in to leave a comment.
9 comments