Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

QualByDepth documentation issue

Answered
0

18 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hello Uday Evani, in the code GATK calculates QualByDepth using AD. If there is no AD, it is calculated with DP as a fallback. 

    If you are seeing QualByDepth calculated from DP instead of AD when they are both present, it may be a bug. Is this what you are referring to?

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    Yes, i think this might be a bug. I agree AD is more appropriate than using DP. I spot checked a few variants in a VCF that was generated using GATK3.5 and in most cases the QUAL is being divided by DP. In fact, the multi-sample example that is provided in the documentation is also dividing by DP. There were few cases where it is unclear what is the denominator, see example 3 below. I am happy to upload a small example VCF if its helpful.

    Example 1:

    chr21 5134832 . G A 41.17 VQSRTrancheSNP99.80to100.00 AC=1;AF=0.167;AN=6;BaseQRankSum=-6.840e-01;ClippingRankSum=-3.220e-01;DP=14;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.167;MQ=27.00;MQ0=0;MQRankSum=-1.517e+00;QD=5.15;ReadPosRankSum=0.684;SOR=0.368;VQSLOD=-1.754e+01;culprit=MQ
    GT:AB:AD:DP:GQ:PL 0/1:0.630:5,3:8:71:71,0,135 0/0:.:2,0:2:6:0,6,70 0/0:.:4,0:4:12:0,12,119

    Example 2:

    chr21 5216481 . T A 1410.92 PASS AC=3;AF=0.500;AN=6;BaseQRankSum=0.146;ClippingRankSum=0.051;DP=108;ExcessHet=6.9897;FS=0.000;MLEAC=3;MLEAF=0.500;MQ=52.46;MQ0=0;MQRankSum=0.556;NEGATIVE_TRAIN_SITE;QD=14.70;ReadPosRankSum=-4.880e-01;SOR=0.632;VQSLOD=0.191;culprit=MQ GT:AB:AD:DP:GQ:PL 0/1:0.680:17,8:25:99:203,0,516 0/1:0.400:14,21:35:99:616,0,381 0/1:0.420:15,21:36:99:622,0,413

    Example 3:

    chr21 5067051 . GAAA GA,G 1714.77 PASS AC=2,2;AF=0.500,0.500;AN=4;DP=44;ExcessHet=3.0103;FS=0.000;MLEAC=2,2;MLEAF=0.500,0.500;MQ=32.64;MQ0=0;QD=27.64;SOR=0.741;VQSLOD=2.50;culprit=DP GT:AD:DP:GQ:PL 2/2:0,0,30:30:90:1350,1350,1350,90,90,0 ./.:0,0,0:0 1/1:0,11,0:11:33:394,33,0,394,33,394

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Uday Evani, we do not support GATK3 anymore. Does this also occur in more recent versions of GATK4?

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    Looks like it is happening in the more recent version as well. Here are a couple of variants from GATK version 4.1.8.0.

    chr1 13613 . T A 49.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-0.328;CNN_1D=-4.137;DP=7;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=29.36;MQRankSum=-0.744;QD=7.09;ReadPosRankSum=0.328;SOR=0.446 GT:AD:DP:GQ:PL 0/1:4,3:7:57:57,0,85

     

    chr1 14653 . C T 267.25 PASS AC=2;AF=1;AN=2;BaseQRankSum=-0.381;CNN_1D=-0.422;DP=13;ExcessHet=3.0103;FS=6.021;MLEAC=1;MLEAF=0.5;MQ=32.51;MQRankSum=1.06;QD=22.27;ReadPosRankSum=1.01;SOR=2.43 GT:AD:DP:GQ:PL 1/1:1,11:12:8:280,8,0

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Uday Evani I will bring this up with my team and get back to you when I have more information.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Uday Evani It doesn't look like the variants you provided contradicted the documentation.

    1. QUAL = 29.64. AD = 4,3. QD = 29.64 / 7 = 7.09

    2. QUAL = 267.25. AD = 1,11 QD = 267.25 / 12 = 22.27

    The calculations above are the same as the documentation for QD. Please let me know if there are any other issues present. 

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    In the first case shouldn't the QUAL have been divided by alternate allele depth which is 3 and not depth (DP) which is 7? Similarly in the second case the alternate allele depth is 11 whereas it being divided by total depth (DP) 12.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Uday Evani the documentation for QD describes how this calculation works. Please see the examples and the description there because the way you are describing the calculation is not how QD is calculated. The AD value used is the sum of the depths of each allele. 

    We have other annotations that may be useful to you, possibly AS_QualByDepth. You can view all the GATK annotations available here

    In many cases, AD and DP can match, however, they are calculated with different information. Please see this document on AD being different than DP and more information in our VCF overview

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    Thank you for pointing to the link describing the difference between AD and DP calculation. That was helpful. I am still a bit unclear especially because this article, https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants suggests that hom-var samples contribute twice as many reads supporting the variant than do het variants. If we assume diploid then this would only happen if we are looking at the reads supporting the alternate allele and not all the including the reference. Maybe i just need to go over the documentation once again. Thank you for all your help. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Uday Evani Glad that helped out! The GATK support efforts on the forum are focused on errors and abnormal results and so I will make a note about the hard filtering document being confusing for our team that works on documentation.

    For further clarification and explanation we are building a backlog to work through when we have the capacity. We also encourage other users to help out when they know the answer. 

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools. For context, check out our support policy.

    0
    Comment actions Permalink
  • Avatar
    Nicolas Servant

    Hi,

    To follow up on that discussion, I also found some strange results in the QD field. I made two tests ; first running haplotypeCaller on a single complete bam file (WES), or running it with `-L` option, in parallele, per chromosome.

    When comparing the results, I found 2 (over >40000 variants) which are found only in the 'per-chromosome' mode. Going further, I found that around 1/4 of my variants only differ in QD field (which may explain that these 2 variants were filtered in one scenario and not the other)

    Here is an exemple ;

    chrX 153174819 rs11394527 T TG 3086.03 . AC=2;AF=1;AN=2;DB;DP=92;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=33.55;SOR=3.391 GT:AD:DP:GQ:PL 1/1:0,87:87:99:3100,262,0

    chrX 153174819 rs11394527 T TG 3086.03 . AC=2;AF=1;AN=2;DB;DP=92;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=28.19;SOR=3.391 GT:AD:DP:GQ:PL 1/1:0,87:87:99:3100,262,0

    And actually, none of the QD values correspond to the calculation you explained before.

    I would expect in both cases ; QD = 3086.03 / 87 = 35.47

    Would you have any explanation for that ? many thanks

    I'm using bioconda::gatk4=4.1.6.0=py38_0

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Nicolas Servant,

    Different methods can have different results. When you are running in per-chromosome mode, some reads are restricted from the analysis if they are not included in the interval. When GATK's HaplotypeCaller algorithm runs on these reads (The four steps of the algorithm are summarized here), there can be slight differences if the input reads are different. 

    Could you clarify what you mean here?

    Going further, I found that around 1/4 of my variants only differ in QD field (which may explain that these 2 variants were filtered in one scenario and not the other)

    I'm not sure what you mean by only differing in the QD field.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Nicolas Servant

    Hi Genevieve Brandt (she/her) !

    When I compared the results of the GATK calls in per-chromosome mode, versus in whole-exome mode, I only have minor differences in the total number of detected variants. However, I realized that for almost 25% of the variants called by the two methods, the QD value is different in the VCF file. All others VCF fields are exactly the same, except the QD (as shown in my previous exemple).

    The other point is that for these variants, the QD value does not match the expected calculation, (ie. QD=QUAL/DP as explained above on the thread), and so I was wondering how  it was calculated ? and why  the QD value is different between the two modes.

    Many thanks for your help. All the best, Nicolas

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    I see, I'm not sure why these would have different QD values. The QD calculation should be what is described above and in this document. Is there one method that matches this calculation and the other does not?

    How many samples do you have?

    0
    Comment actions Permalink
  • Avatar
    Nicolas Servant

    I only have one sample in the vcf. And actually, that's my point, on the example I put below, none of the QD calculation seems to work in this case ...

    Best

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Nicolas Servant,

    Thanks for these details. I checked in with the developer team and found more information about the QD calculation that is not included in our documentation.

    Variants with a QD over 35 have an extra correction applied because these QD values can be outliers and cause issues with variant filtering. The algorithm gives a corrected randomized QD value using a Gaussian model around 30. 

    We are trying to correct for large indels and phased variants which give incredibly high QD values. The other reason you might see this is because AD only uses informative reads, and if there are few informative reads, the QD value could become very high (since QD is QUAL/AD). You can read more about informative reads and AD here: Allele Depth (AD) is lower than expected.

    I am going to submit a documentation request for our documentation team to include this information in the annotation description. Thanks for bringing this up!

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Nicolas Servant

    Thank you Genevieve Brandt (she/her) for the answer.

    When you say that " The algorithm gives a corrected randomized QD value using a Gaussian model around 30. ", it would be great to have a way to fix a seed or something to ensure reproducibility of the calculation. Thanks again for you help !

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for the feedback Nicolas Servant, I'll add this to our feature request backlog! 

    If other users would benefit from this as well, please comment on this post.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk