Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

The depth of SNV/InDels calculated by Mutect2 in GATK4.2.0.0 is much lower than real sequencing depth, why?

Answered
0

19 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Ce Wang,

    We fixed the option --dont-use-soft-clipped-bases in GATK 4.2.0.0 to actually work as intended, there was a bug with that option in GATK 4.1.9.0. 

    The AD in 4.2.0.0 should be correct after taking into account soft clipped bases. When you say that 4.1.9.0 is concordant with the "true depth", which true depth is this? You can look on IGV to see what is going on here, if there are soft clipped bases at this location, then that explains the difference.

    Let me know if this is not occurring.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Ce Wang

    Hi Genevieve, 

    Thanks to your advice. But I think soft clipped base is not the case, because the ratio of soft clipped reads in my data is very low. The picture below shows the feature of the site I mentioned before in IGV.  

    The depth of this site is 944 in IGV, which is concordant with the AD in 4.1.9.0:

    So I think there exists other reason to explain the difference. 

    Best, 

    Ce Wang

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Ce Wang I see, thanks for sharing those. Could you compare the IGV depth from the -bamout using these two GATK versions? Are there differences in the realigned reads?

    Here is a helpful article that describes what to look for when troubleshooting certain sites: https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant

    0
    Comment actions Permalink
  • Avatar
    Ce Wang

    The two bam files generated by --bam-output have the difference in depth. Each one is concordant with corresponding AD. The two pictures below show the same location in 4.1.9.0-bamout and 4.2.0.0-bamout respectively. 

    4.1.9.0: 

    4.2.0.0: 

    All of the target regions have the difference in depth like this. So the size of bam file output by 4.2.0.0 is much smaller than that of 4.1.9.0.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Ce Wang Could you zoom out and compare these two bams to see what the reads look like at this location? Mutect2 realigns the reads so there is a possibility that after realignment there were more soft clipped bases at this location.

    Here is a description of the local re-assembly and haplotype determination: https://gatk.broadinstitute.org/hc/en-us/articles/360036227612-Local-re-assembly-and-haplotype-determination-HaplotypeCaller-and-Mutect2-

    0
    Comment actions Permalink
  • Avatar
    Ce Wang

    Genevieve, there is no soft clipped read at this location in both bams. 

    4.1.9.0:

    4.2.0.0:

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    I see, I want to look into whether this difference is from the --dont-use-soft-clipped-bases argument or the GATK assembly engine. 

    Could you run this same comparison but without the --dont-use-soft-clipped-bases argument and post the results?

    0
    Comment actions Permalink
  • Avatar
    Ce Wang

    Genevieve, I run without --dont-use-soft-clipped-bases as you said, but got almost the same result as before. The difference still exists. 

    4.2.0.0: 

    1 24868045 . A G . . AC=1;AF=0.500;AN=2;AS_MQ=60.00;AS_SB_TABLE=54,53|47,23;AS_UNIQ_ALT_READ_COUNT=70;BaseQRankSum=1.299;ClippingRankSum=-1.262;DP=186;ECNT=2;FS=15.108;LikelihoodRankSum=-0.021;MBQ=37,37;MFRL=225,238;MMQ=60,60;MPOS=45;MQ=60.00;MQ0=0;MQRankSum=0.000;NCC=0;NCount=0;OCM=0;PON;POPAF=7.30;REF_BASES=GCTCAGCAGAACAGACCCAGA;ReadPosRankSum=1.251;SOR=1.580;Samples=HD786_4-1;TLOD=230.82 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:107,70:0.397:177:60,31:45,39:54,53,47,23

    4.1.9.0: 

    1 24868045 . A G . . AC=1;AF=0.500;AN=2;AS_MQ=60.00;AS_SB_TABLE=299,290|183,155;AS_UNIQ_ALT_READ_COUNT=328;BaseQRankSum=1.715;ClippingRankSum=-0.613;DP=934;ECNT=2;FS=1.802;LikelihoodRankSum=0.052;MBQ=20,20;MFRL=153,145;MMQ=60,60;MPOS=39;MQ=60.00;MQ0=0;MQRankSum=0.000;NCC=0;NCount=0;OCM=0;PON;POPAF=7.30;REF_BASES=GCTCAGCAGAACAGACCCAGA;ReadPosRankSum=3.636;SOR=0.837;Samples=HD786_4-1;TLOD=779.29 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:589,338:0.371:927:290,167:293,171:299,290,183,155

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Ce Wang

    Thanks for your patience! We had a long weekend here in the US. Could you send an excerpt of your input file around this locus so we can look at it more closely? Here are the instructions for how to submit a bug report:

    https://gatk.broadinstitute.org/hc/en-us/articles/360035889671

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Ce Wang I also created a github ticket so that our developer team can determine if it is a bug: https://github.com/broadinstitute/gatk/issues/7285

    0
    Comment actions Permalink
  • Avatar
    Ce Wang

    Hi Genevieve,

    I just found it was "--min-base-quality-score 25" in my command line that caused the problem. There is a default setting "--pcr-snv-qual 40" in Mutect2 acting as this role according to the Notes on Mutect2 (David Benjamin, et al., 2020): 

    So most base qualities were adjusted to 20 by default, which can not be calculated by AD as they are less than 25, although most base qualities in my data is higher than 30. I tried to run 4.2.0.0 with "--pcr-snv-qual 80", and got the similar AD to 4.1.9.0. 

    It shows that 4.1.9.0 may use origin base quality to calculate AD, but 4.2.0.0 use adjusted base quality to do this. I guess most users will not notice this. Would it be better to add some description or increase the default value of --pcr-snv-qual?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Ce Wang,

    Thanks for sharing your insights and looking into this further. The GATK 4.2.0.0 release was a big release and it looks like you may have identified some sort of bug that is embedded into the assembly engine. So far, we haven't been able to locate which of the changes between 4.1.9.0 and 4.2.0.0 would have caused this. Could you still upload the bug report from my previous comment so that we can trouble shoot the issue you found?

    In regards to a workaround for you now, we generally do not recommend users to change the --min-base-quality-score parameter because there are already built in read filters with GATK. Instead of changing the --pcr-snv-qual argument, I would recommend that you just use the default read filter settings and remove the --min-base-quality-score 25.

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Ce Wang I'm also wondering how many sites have this change of the AD? Is it just a few, just this one, or many sites?

    0
    Comment actions Permalink
  • Avatar
    Ce Wang

    Hi Genevieve, 

    Thanks for your recommendation. I also found another problem without changing the --pcr-snv-qual parameter. If I use default value 40, MBQ of most sites will be 20,20 for ref and alt respectively. Actually, MBQ is a useful base quality indicator for my downstream filtration, since all the MBQs of true positive sites will be high (>30) in my data. As MBQ represent median base quality, some false positive sites can be filtered based on relative low MBQ.

    In addition, for your last question, all of the sites in VCF have this change in AD.

    I will upload the bug report later. Thank you. 

    0
    Comment actions Permalink
  • Avatar
    Ce Wang

    Hi Genevieve,

    I have uploaded the BAM file around this locus to ftp. The file name is issue7285_bam.tar.gz.

    Thank you.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Ce Wang,

    Thanks for sending that in, we are still trying to figure out which of the changes in GATK 4.2.0.0 would be responsible for the difference you are seeing.

    Would you be able to check and see what is the difference between 4.1.9.0 and 4.2.0.0 without using the --min-base-quality-score 25 argument? Are the outputs the same? What is the purpose in your analysis with the --min-base-quality-score 25 argument, since it's not an argument we generally recommend?

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Ce Wang

    Genevieve, the outputs are very similar between 4.1.9.0 and 4.2.0.0 without using the --min-base-quality-score 25 argument. 

    4.1.9.0: 

    1 24868045 . A G . . AC=1;AF=0.500;AN=2;AS_MQ=60.00;AS_SB_TABLE=291,289|180,154;AS_UNIQ_ALT_READ_COUNT=310;BaseQRankSum=1.832;ClippingRankSum=-3.284;DP=921;ECNT=2;FS=1.819;LikelihoodRankSum=-1.307;MBQ=20,20;MFRL=153,145;MMQ=60,60;MPOS=39;MQ=60.00;MQ0=0;MQRankSum=0.000;NCC=0;NCount=0;OCM=0;PON;POPAF=7.30;REF_BASES=GCTCAGCAGAACAGACCCAGA;ReadPosRankSum=3.726;SOR=0.852;Samples=HD786_4-1;TLOD=774.33 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:580,334:0.374:914:286,165:288,169:291,289,180,154

    4.2.0.0: 

    1 24868045 . A G . . AC=1;AF=0.500;AN=2;AS_MQ=60.00;AS_SB_TABLE=296,299|182,159;AS_UNIQ_ALT_READ_COUNT=317;BaseQRankSum=1.693;ClippingRankSum=-3.099;DP=943;ECNT=2;FS=1.814;LikelihoodRankSum=-1.352;MBQ=20,20;MFRL=152,145;MMQ=60,60;MPOS=39;MQ=60.00;MQ0=0;MQRankSum=0.000;NCC=0;NCount=0;OCM=0;PON;POPAF=7.30;REF_BASES=GCTCAGCAGAACAGACCCAGA;ReadPosRankSum=3.573;SOR=0.828;Samples=HD786_4-1;TLOD=791.41 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:595,341:0.370:936:294,168:295,173:296,299,182,159

    My purpose of using --min-base-quality-score 25 is to filter variants with low base quality in raw vcf. Actually, it's not very necessary. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Ce Wang I think at this point I would recommend relying on the default filtering methods, as Mutect2 employs multiple read filters, which you can see at the tool docs page: https://gatk.broadinstitute.org/hc/en-us/articles/360056969692-Mutect2

    You can keep following along at the github ticket where the developers will look into this as they have the capacity. I'll post your update there, and feel free to comment any further questions about this issue there.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    @Ce Wang The change in 4.2.0.0 that had the effect you saw was this PR: https://github.com/broadinstitute/gatk/pull/6823.  This fixed a bug where various read transformations were not actually having any effect because they were being applied to a temporary copy of the reads and not the reads themselves.  Thus this PR made the `-min-base-quality-score` argument work as intended, which combined with the base quality reduction applied to overlapping reads triggered the severe truncation of most of your reads.

    As Genevieve said, the best solution is to use the default for -min-base-quality-score.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk