GATK4 CombineGVCFs gives wrong allelic depth
Hi GATK support,
I am trying to use gatk4 combineGVCFs to combine two individual gvcf files for joint calling and found that CombineGVCFs gives wrong allelic depth. Here is the example:
For locus 2:182816563 it is homozygous reference with reference allelic depth=57 and alternative non reference allelic depth=2
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
2 182816563 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:57,2:59:99:0,120,1800
For the same locus in another sample, there is an indel present.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample2
2 182816563 . TCTTTA T,<NON_REF> 2533.73 . AS_RAW_BaseQRankSum=|-1.1,1|NaN;AS_RAW_MQ=190800.00|234000.00|0.00;AS_RAW_MQRankSum=|0.0,1|NaN;AS_RAW_ReadPosRankSum=|3.1,1|NaN;AS_SB_TABLE=36,17|41,24|0,0;BaseQRankSum=-1.030;DP=121;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=435600.00;ReadPosRankSum=3.176 GT:AD:GQ:PL:SB 0/1:53,65,0:99:2571,0,3497,2730,3693,6423:36,17,41,24
When I used combineGVCF to combine this locus from two samples, I found that the allelic depth for Sample 1 was wrongly assigned as 57,2,2. The reasons why I think it is wrong:
1. This allelic depth does not add to the overall allelic depth 59.
2. There is no deletion at this locus for sample 2. But the allelic depth shows that for Alternative allele T, the allelic depth is 2.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample2 Sample1
2 182816563 . TCTTTA T,<NON_REF> . . AS_RAW_BaseQRankSum=|-1.1,1|;AS_RAW_MQ=190800.00|234000.00|0.00;AS_RAW_MQRankSum=|0.0,1|;AS_RAW_ReadPosRankSum=|3.1,1|;AS_SB_TABLE=36,17|41,24|0,0;BaseQRankSum=-1.030e+00;DP=180;ExcessHet=3.01;MQRankSum=0.00;RAW_MQ=435600.00;ReadPosRankSum=3.18 GT:AD:DP:GQ:PL:SB ./.:53,65,0:.:99:2571,0,3497,2730,3693,6423:36,17,41,24./.:57,2,2:59:99:0,120,1800,120,1800,1800
I would really appreciate your help in this issue. Thank you!
GATK version used: gatk4-4.3.0.0-0
Exact command used: gatk CombineGVCFs -R $miller/ref/hs37d5/hs37d5.fa -V 1278_heart_bulk.g.vcf.gz -V 1278BA9-A.g.vcf.gz -O combine2.vcf.gz -G StandardAnnotation -G AS_StandardAnnotation
Entire program log:
Using GATK jar /home/boj924/miniconda3/envs/scan2/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /home/boj924/miniconda3/envs/scan2/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar CombineGVCFs -R /n/data1/bwh/pathology/miller/lab/ref/hs37d5/hs37d5.fa -V 1278_heart_bulk.g.vcf.gz -V 1278BA9-A.g.vcf.gz -O combine2.vcf.gz -G StandardAnnotation -G AS_StandardAnnotation
14:19:48.385 WARN GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
14:19:48.496 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/boj924/miniconda3/envs/scan2/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
14:19:48.807 INFO CombineGVCFs - ------------------------------------------------------------
14:19:48.807 INFO CombineGVCFs - The Genome Analysis Toolkit (GATK) v4.3.0.0
14:19:48.807 INFO CombineGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
14:19:48.808 INFO CombineGVCFs - Executing as boj924@compute-a-16-171.o2.rc.hms.harvard.edu on Linux v3.10.0-1160.102.1.el7.x86_64 amd64
14:19:48.808 INFO CombineGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_332-b09
14:19:48.808 INFO CombineGVCFs - Start Date/Time: January 11, 2024 2:19:48 PM EST
14:19:48.808 INFO CombineGVCFs - ------------------------------------------------------------
14:19:48.808 INFO CombineGVCFs - ------------------------------------------------------------
14:19:48.809 INFO CombineGVCFs - HTSJDK Version: 3.0.1
14:19:48.809 INFO CombineGVCFs - Picard Version: 2.27.5
14:19:48.809 INFO CombineGVCFs - Built for Spark Version: 2.4.5
14:19:48.809 INFO CombineGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:19:48.809 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:19:48.809 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:19:48.810 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:19:48.810 INFO CombineGVCFs - Deflater: IntelDeflater
14:19:48.810 INFO CombineGVCFs - Inflater: IntelInflater
14:19:48.810 INFO CombineGVCFs - GCS max retries/reopens: 20
14:19:48.810 INFO CombineGVCFs - Requester pays: disabled
14:19:48.810 INFO CombineGVCFs - Initializing engine
14:19:49.600 INFO FeatureManager - Using codec VCFCodec to read file file:///home/boj924/AD_Tau_PTA/1278_heart_bulk.g.vcf.gz
14:19:49.630 WARN IntelInflater - Zero Bytes Written : 0
14:19:49.652 INFO FeatureManager - Using codec VCFCodec to read file file:///home/boj924/AD_Tau_PTA/1278BA9-A.g.vcf.gz
14:19:49.663 WARN IntelInflater - Zero Bytes Written : 0
14:19:49.686 WARN IntelInflater - Zero Bytes Written : 0
14:19:49.704 WARN IntelInflater - Zero Bytes Written : 0
14:19:49.900 INFO CombineGVCFs - Done initializing engine
14:19:49.939 INFO ProgressMeter - Starting traversal
14:19:49.939 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
14:19:49.942 WARN IntelInflater - Zero Bytes Written : 0
14:19:49.973 WARN IntelInflater - Zero Bytes Written : 0
14:19:50.068 WARN ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location 2:182816563 the annotation MLEAC=[1, 0] was not a numerical value and was ignored
14:19:50.099 INFO ProgressMeter - unmapped 0.0 22 8250.0
14:19:50.099 INFO ProgressMeter - Traversal complete. Processed 22 total variants in 0.0 minutes.
14:19:50.118 INFO CombineGVCFs - Shutting down engine
[January 11, 2024 2:19:50 PM EST] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=324534272
-
Hi Bowen Jin,
You'll note that in the GVCF the 2 in the AD field refers to the non-reference allele. The behavior of the tool is as expected: it assigns the non-ref AD to every allele it sees that is not the reference. Maybe not desirable, but expected. If this is not what you want to happen, you can run the ReblockGVCF tool, which will zero out non-ref AD values and prevent them from being propagated when more alleles are discovered in joint calling.
Please sign in to leave a comment.
1 comment