QualByDepth documentation issue
AnsweredHi, QualByDepth documentation mentions the QUAL value is divided by Allele Depth (AD) whereas it's actually divided by Depth (DP). AD sounds misleading because there is an FORMAT field AD which refers to alternate allele depth. I can see how in a multi-sample context DP might sound misleading but the documentation is clear that depth of samples with hom-ref call is not counted.
https://gatk.broadinstitute.org/hc/en-us/articles/360036856351-QualByDepth
-
Hello Uday Evani, in the code GATK calculates QualByDepth using AD. If there is no AD, it is calculated with DP as a fallback.
If you are seeing QualByDepth calculated from DP instead of AD when they are both present, it may be a bug. Is this what you are referring to?
-
Yes, i think this might be a bug. I agree AD is more appropriate than using DP. I spot checked a few variants in a VCF that was generated using GATK3.5 and in most cases the QUAL is being divided by DP. In fact, the multi-sample example that is provided in the documentation is also dividing by DP. There were few cases where it is unclear what is the denominator, see example 3 below. I am happy to upload a small example VCF if its helpful.
Example 1:
chr21 5134832 . G A 41.17 VQSRTrancheSNP99.80to100.00 AC=1;AF=0.167;AN=6;BaseQRankSum=-6.840e-01;ClippingRankSum=-3.220e-01;DP=14;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.167;MQ=27.00;MQ0=0;MQRankSum=-1.517e+00;QD=5.15;ReadPosRankSum=0.684;SOR=0.368;VQSLOD=-1.754e+01;culprit=MQ
GT:AB:AD:DP:GQ:PL 0/1:0.630:5,3:8:71:71,0,135 0/0:.:2,0:2:6:0,6,70 0/0:.:4,0:4:12:0,12,119Example 2:
chr21 5216481 . T A 1410.92 PASS AC=3;AF=0.500;AN=6;BaseQRankSum=0.146;ClippingRankSum=0.051;DP=108;ExcessHet=6.9897;FS=0.000;MLEAC=3;MLEAF=0.500;MQ=52.46;MQ0=0;MQRankSum=0.556;NEGATIVE_TRAIN_SITE;QD=14.70;ReadPosRankSum=-4.880e-01;SOR=0.632;VQSLOD=0.191;culprit=MQ GT:AB:AD:DP:GQ:PL 0/1:0.680:17,8:25:99:203,0,516 0/1:0.400:14,21:35:99:616,0,381 0/1:0.420:15,21:36:99:622,0,413
Example 3:
chr21 5067051 . GAAA GA,G 1714.77 PASS AC=2,2;AF=0.500,0.500;AN=4;DP=44;ExcessHet=3.0103;FS=0.000;MLEAC=2,2;MLEAF=0.500,0.500;MQ=32.64;MQ0=0;QD=27.64;SOR=0.741;VQSLOD=2.50;culprit=DP GT:AD:DP:GQ:PL 2/2:0,0,30:30:90:1350,1350,1350,90,90,0 ./.:0,0,0:0 1/1:0,11,0:11:33:394,33,0,394,33,394
-
Uday Evani, we do not support GATK3 anymore. Does this also occur in more recent versions of GATK4?
-
Looks like it is happening in the more recent version as well. Here are a couple of variants from GATK version 4.1.8.0.
chr1 13613 . T A 49.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-0.328;CNN_1D=-4.137;DP=7;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=29.36;MQRankSum=-0.744;QD=7.09;ReadPosRankSum=0.328;SOR=0.446 GT:AD:DP:GQ:PL 0/1:4,3:7:57:57,0,85
chr1 14653 . C T 267.25 PASS AC=2;AF=1;AN=2;BaseQRankSum=-0.381;CNN_1D=-0.422;DP=13;ExcessHet=3.0103;FS=6.021;MLEAC=1;MLEAF=0.5;MQ=32.51;MQRankSum=1.06;QD=22.27;ReadPosRankSum=1.01;SOR=2.43 GT:AD:DP:GQ:PL 1/1:1,11:12:8:280,8,0
-
Uday Evani I will bring this up with my team and get back to you when I have more information.
-
Uday Evani It doesn't look like the variants you provided contradicted the documentation.
1. QUAL = 29.64. AD = 4,3. QD = 29.64 / 7 = 7.09
2. QUAL = 267.25. AD = 1,11 QD = 267.25 / 12 = 22.27
The calculations above are the same as the documentation for QD. Please let me know if there are any other issues present.
-
In the first case shouldn't the QUAL have been divided by alternate allele depth which is 3 and not depth (DP) which is 7? Similarly in the second case the alternate allele depth is 11 whereas it being divided by total depth (DP) 12.
-
Uday Evani the documentation for QD describes how this calculation works. Please see the examples and the description there because the way you are describing the calculation is not how QD is calculated. The AD value used is the sum of the depths of each allele.
We have other annotations that may be useful to you, possibly AS_QualByDepth. You can view all the GATK annotations available here.
In many cases, AD and DP can match, however, they are calculated with different information. Please see this document on AD being different than DP and more information in our VCF overview.
-
Thank you for pointing to the link describing the difference between AD and DP calculation. That was helpful. I am still a bit unclear especially because this article, https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants suggests that hom-var samples contribute twice as many reads supporting the variant than do het variants. If we assume diploid then this would only happen if we are looking at the reads supporting the alternate allele and not all the including the reference. Maybe i just need to go over the documentation once again. Thank you for all your help.
-
Uday Evani Glad that helped out! The GATK support efforts on the forum are focused on errors and abnormal results and so I will make a note about the hard filtering document being confusing for our team that works on documentation.
For further clarification and explanation we are building a backlog to work through when we have the capacity. We also encourage other users to help out when they know the answer.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools. For context, check out our support policy.
-
Hi,
To follow up on that discussion, I also found some strange results in the QD field. I made two tests ; first running haplotypeCaller on a single complete bam file (WES), or running it with `-L` option, in parallele, per chromosome.
When comparing the results, I found 2 (over >40000 variants) which are found only in the 'per-chromosome' mode. Going further, I found that around 1/4 of my variants only differ in QD field (which may explain that these 2 variants were filtered in one scenario and not the other)
Here is an exemple ;
chrX 153174819 rs11394527 T TG 3086.03 . AC=2;AF=1;AN=2;DB;DP=92;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=33.55;SOR=3.391 GT:AD:DP:GQ:PL 1/1:0,87:87:99:3100,262,0
chrX 153174819 rs11394527 T TG 3086.03 . AC=2;AF=1;AN=2;DB;DP=92;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=28.19;SOR=3.391 GT:AD:DP:GQ:PL 1/1:0,87:87:99:3100,262,0
And actually, none of the QD values correspond to the calculation you explained before.
I would expect in both cases ; QD = 3086.03 / 87 = 35.47
Would you have any explanation for that ? many thanks
I'm using bioconda::gatk4=4.1.6.0=py38_0
-
Hi Nicolas Servant,
Different methods can have different results. When you are running in per-chromosome mode, some reads are restricted from the analysis if they are not included in the interval. When GATK's HaplotypeCaller algorithm runs on these reads (The four steps of the algorithm are summarized here), there can be slight differences if the input reads are different.
Could you clarify what you mean here?
Going further, I found that around 1/4 of my variants only differ in QD field (which may explain that these 2 variants were filtered in one scenario and not the other)
I'm not sure what you mean by only differing in the QD field.
Best,
Genevieve
-
Hi Genevieve Brandt (she/her) !
When I compared the results of the GATK calls in per-chromosome mode, versus in whole-exome mode, I only have minor differences in the total number of detected variants. However, I realized that for almost 25% of the variants called by the two methods, the QD value is different in the VCF file. All others VCF fields are exactly the same, except the QD (as shown in my previous exemple).
The other point is that for these variants, the QD value does not match the expected calculation, (ie. QD=QUAL/DP as explained above on the thread), and so I was wondering how it was calculated ? and why the QD value is different between the two modes.
Many thanks for your help. All the best, Nicolas
-
I see, I'm not sure why these would have different QD values. The QD calculation should be what is described above and in this document. Is there one method that matches this calculation and the other does not?
How many samples do you have?
-
I only have one sample in the vcf. And actually, that's my point, on the example I put below, none of the QD calculation seems to work in this case ...
Best
-
Hi Nicolas Servant,
Thanks for these details. I checked in with the developer team and found more information about the QD calculation that is not included in our documentation.
Variants with a QD over 35 have an extra correction applied because these QD values can be outliers and cause issues with variant filtering. The algorithm gives a corrected randomized QD value using a Gaussian model around 30.
We are trying to correct for large indels and phased variants which give incredibly high QD values. The other reason you might see this is because AD only uses informative reads, and if there are few informative reads, the QD value could become very high (since QD is QUAL/AD). You can read more about informative reads and AD here: Allele Depth (AD) is lower than expected.
I am going to submit a documentation request for our documentation team to include this information in the annotation description. Thanks for bringing this up!
Best,
Genevieve
-
Thank you Genevieve Brandt (she/her) for the answer.
When you say that " The algorithm gives a corrected randomized QD value using a Gaussian model around 30. ", it would be great to have a way to fix a seed or something to ensure reproducibility of the calculation. Thanks again for you help !
-
Thanks for the feedback Nicolas Servant, I'll add this to our feature request backlog!
If other users would benefit from this as well, please comment on this post.
Please sign in to leave a comment.
18 comments