The depth of SNV/InDels calculated by Mutect2 in GATK4.2.0.0 is much lower than real sequencing depth, why?
AnsweredHi GATK team,
GATK version: 4.2.0.0
Sample: Target region sequencing, human cancer, everage depth 1000X
I'm using Mutect2 in GATK4.2.0.0 to call somatic SNV/InDels. What confused me is that I found the depth of each location emitted from "AD" field in vcf is much lower than the real depth. I think I have disabled downsampling by set "--max-reads-per-alignment-start" to 0. The command line I used is as follow:
gatk Mutect2 -R reference.fa -I tumor.bam --panel-of-normals pon.vcf.gz -L target.bed -O sample.snvIndels.vcf --callable-depth 30 --f1r2-tar-gz sample.f1r2.tar.gz --min-base-quality-score 25 --max-reads-per-alignment-start 0 --minimum-allele-fraction 0.002 --dont-use-soft-clipped-bases --force-active --mitochondria-mode --enable-all-annotations
For example, a mutated point information in vcf called by GATK4.2.0.0-Mutect2 is:
1 24868045 . A G . . AC=1;AF=0.500;AN=2;AS_MQ=60.00;AS_SB_TABLE=51,50|46,23;AS_UNIQ_ALT_READ_COUNT=69;BaseQRankSum=0.561;ClippingRankSum=-1.473;DP=179;ECNT=2;FS=13.849;LikelihoodRankSum=-0.392;MBQ=37,37;MFRL=236,239;MMQ=60,60;MPOS=44;MQ=60.00;MQ0=0;MQRankSum=0.000;NCC=0;NCount=0;OCM=0;PON;POPAF=7.30;REF_BASES=GCTCAGCAGAACAGACCCAGA;ReadPosRankSum=1.335;SOR=1.545;Samples=HD786_4-1;TLOD=230.09 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:101,69:0.408:170:54,30:45,39:51,50,46,23
But the information of the same point in vcf called by GATK4.1.9.0-Mutect2 using the same command is:
1 24868045 . A G . . AC=1;AF=0.500;AN=2;AS_MQ=60.00;AS_SB_TABLE=299,290|183,155;AS_UNIQ_ALT_READ_COUNT=328;BaseQRankSum=1.715;ClippingRankSum=-0.613;DP=934;ECNT=2;FS=1.802;LikelihoodRankSum=0.052;MBQ=20,20;MFRL=153,145;MMQ=60,60;MPOS=39;MQ=60.00;MQ0=0;MQRankSum=0.000;NCC=0;NCount=0;OCM=0;PON;POPAF=7.30;REF_BASES=GCTCAGCAGAACAGACCCAGA;ReadPosRankSum=3.636;SOR=0.837;Samples=HD786_4-1;TLOD=779.29 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:589,338:0.371:927:290,167:293,171:299,290,183,155
This "580+338" is exactly the true depth. Is there any other downsampling or filter in GATK4.2.0.0-Mutect2 but not in GATK4.1.9.0-Mutect2?
-
Hi Ce Wang,
We fixed the option --dont-use-soft-clipped-bases in GATK 4.2.0.0 to actually work as intended, there was a bug with that option in GATK 4.1.9.0.
The AD in 4.2.0.0 should be correct after taking into account soft clipped bases. When you say that 4.1.9.0 is concordant with the "true depth", which true depth is this? You can look on IGV to see what is going on here, if there are soft clipped bases at this location, then that explains the difference.
Let me know if this is not occurring.
Best,
Genevieve
-
Hi Genevieve,
Thanks to your advice. But I think soft clipped base is not the case, because the ratio of soft clipped reads in my data is very low. The picture below shows the feature of the site I mentioned before in IGV.
The depth of this site is 944 in IGV, which is concordant with the AD in 4.1.9.0:
So I think there exists other reason to explain the difference.
Best,
Ce Wang
-
Ce Wang I see, thanks for sharing those. Could you compare the IGV depth from the -bamout using these two GATK versions? Are there differences in the realigned reads?
Here is a helpful article that describes what to look for when troubleshooting certain sites: https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
-
The two bam files generated by --bam-output have the difference in depth. Each one is concordant with corresponding AD. The two pictures below show the same location in 4.1.9.0-bamout and 4.2.0.0-bamout respectively.
4.1.9.0:
4.2.0.0:
All of the target regions have the difference in depth like this. So the size of bam file output by 4.2.0.0 is much smaller than that of 4.1.9.0.
-
Ce Wang Could you zoom out and compare these two bams to see what the reads look like at this location? Mutect2 realigns the reads so there is a possibility that after realignment there were more soft clipped bases at this location.
Here is a description of the local re-assembly and haplotype determination: https://gatk.broadinstitute.org/hc/en-us/articles/360036227612-Local-re-assembly-and-haplotype-determination-HaplotypeCaller-and-Mutect2-
-
Genevieve, there is no soft clipped read at this location in both bams.
4.1.9.0:
4.2.0.0:
-
I see, I want to look into whether this difference is from the --dont-use-soft-clipped-bases argument or the GATK assembly engine.
Could you run this same comparison but without the --dont-use-soft-clipped-bases argument and post the results?
-
Genevieve, I run without --dont-use-soft-clipped-bases as you said, but got almost the same result as before. The difference still exists.
4.2.0.0:
1 24868045 . A G . . AC=1;AF=0.500;AN=2;AS_MQ=60.00;AS_SB_TABLE=54,53|47,23;AS_UNIQ_ALT_READ_COUNT=70;BaseQRankSum=1.299;ClippingRankSum=-1.262;DP=186;ECNT=2;FS=15.108;LikelihoodRankSum=-0.021;MBQ=37,37;MFRL=225,238;MMQ=60,60;MPOS=45;MQ=60.00;MQ0=0;MQRankSum=0.000;NCC=0;NCount=0;OCM=0;PON;POPAF=7.30;REF_BASES=GCTCAGCAGAACAGACCCAGA;ReadPosRankSum=1.251;SOR=1.580;Samples=HD786_4-1;TLOD=230.82 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:107,70:0.397:177:60,31:45,39:54,53,47,23
4.1.9.0:
1 24868045 . A G . . AC=1;AF=0.500;AN=2;AS_MQ=60.00;AS_SB_TABLE=299,290|183,155;AS_UNIQ_ALT_READ_COUNT=328;BaseQRankSum=1.715;ClippingRankSum=-0.613;DP=934;ECNT=2;FS=1.802;LikelihoodRankSum=0.052;MBQ=20,20;MFRL=153,145;MMQ=60,60;MPOS=39;MQ=60.00;MQ0=0;MQRankSum=0.000;NCC=0;NCount=0;OCM=0;PON;POPAF=7.30;REF_BASES=GCTCAGCAGAACAGACCCAGA;ReadPosRankSum=3.636;SOR=0.837;Samples=HD786_4-1;TLOD=779.29 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:589,338:0.371:927:290,167:293,171:299,290,183,155
-
Hi Ce Wang,
Thanks for your patience! We had a long weekend here in the US. Could you send an excerpt of your input file around this locus so we can look at it more closely? Here are the instructions for how to submit a bug report:
https://gatk.broadinstitute.org/hc/en-us/articles/360035889671
Thank you,
Genevieve
-
Ce Wang I also created a github ticket so that our developer team can determine if it is a bug: https://github.com/broadinstitute/gatk/issues/7285
-
Hi Genevieve,
I just found it was "--min-base-quality-score 25" in my command line that caused the problem. There is a default setting "--pcr-snv-qual 40" in Mutect2 acting as this role according to the Notes on Mutect2 (David Benjamin, et al., 2020):
So most base qualities were adjusted to 20 by default, which can not be calculated by AD as they are less than 25, although most base qualities in my data is higher than 30. I tried to run 4.2.0.0 with "--pcr-snv-qual 80", and got the similar AD to 4.1.9.0.
It shows that 4.1.9.0 may use origin base quality to calculate AD, but 4.2.0.0 use adjusted base quality to do this. I guess most users will not notice this. Would it be better to add some description or increase the default value of --pcr-snv-qual?
-
Hi Ce Wang,
Thanks for sharing your insights and looking into this further. The GATK 4.2.0.0 release was a big release and it looks like you may have identified some sort of bug that is embedded into the assembly engine. So far, we haven't been able to locate which of the changes between 4.1.9.0 and 4.2.0.0 would have caused this. Could you still upload the bug report from my previous comment so that we can trouble shoot the issue you found?
In regards to a workaround for you now, we generally do not recommend users to change the --min-base-quality-score parameter because there are already built in read filters with GATK. Instead of changing the --pcr-snv-qual argument, I would recommend that you just use the default read filter settings and remove the --min-base-quality-score 25.
Thank you,
Genevieve
-
Ce Wang I'm also wondering how many sites have this change of the AD? Is it just a few, just this one, or many sites?
-
Hi Genevieve,
Thanks for your recommendation. I also found another problem without changing the --pcr-snv-qual parameter. If I use default value 40, MBQ of most sites will be 20,20 for ref and alt respectively. Actually, MBQ is a useful base quality indicator for my downstream filtration, since all the MBQs of true positive sites will be high (>30) in my data. As MBQ represent median base quality, some false positive sites can be filtered based on relative low MBQ.
In addition, for your last question, all of the sites in VCF have this change in AD.
I will upload the bug report later. Thank you.
-
Hi Genevieve,
I have uploaded the BAM file around this locus to ftp. The file name is issue7285_bam.tar.gz.
Thank you.
-
Hi Ce Wang,
Thanks for sending that in, we are still trying to figure out which of the changes in GATK 4.2.0.0 would be responsible for the difference you are seeing.
Would you be able to check and see what is the difference between 4.1.9.0 and 4.2.0.0 without using the --min-base-quality-score 25 argument? Are the outputs the same? What is the purpose in your analysis with the --min-base-quality-score 25 argument, since it's not an argument we generally recommend?
Best,
Genevieve
-
Genevieve, the outputs are very similar between 4.1.9.0 and 4.2.0.0 without using the --min-base-quality-score 25 argument.
4.1.9.0:
1 24868045 . A G . . AC=1;AF=0.500;AN=2;AS_MQ=60.00;AS_SB_TABLE=291,289|180,154;AS_UNIQ_ALT_READ_COUNT=310;BaseQRankSum=1.832;ClippingRankSum=-3.284;DP=921;ECNT=2;FS=1.819;LikelihoodRankSum=-1.307;MBQ=20,20;MFRL=153,145;MMQ=60,60;MPOS=39;MQ=60.00;MQ0=0;MQRankSum=0.000;NCC=0;NCount=0;OCM=0;PON;POPAF=7.30;REF_BASES=GCTCAGCAGAACAGACCCAGA;ReadPosRankSum=3.726;SOR=0.852;Samples=HD786_4-1;TLOD=774.33 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:580,334:0.374:914:286,165:288,169:291,289,180,154
4.2.0.0:
1 24868045 . A G . . AC=1;AF=0.500;AN=2;AS_MQ=60.00;AS_SB_TABLE=296,299|182,159;AS_UNIQ_ALT_READ_COUNT=317;BaseQRankSum=1.693;ClippingRankSum=-3.099;DP=943;ECNT=2;FS=1.814;LikelihoodRankSum=-1.352;MBQ=20,20;MFRL=152,145;MMQ=60,60;MPOS=39;MQ=60.00;MQ0=0;MQRankSum=0.000;NCC=0;NCount=0;OCM=0;PON;POPAF=7.30;REF_BASES=GCTCAGCAGAACAGACCCAGA;ReadPosRankSum=3.573;SOR=0.828;Samples=HD786_4-1;TLOD=791.41 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:595,341:0.370:936:294,168:295,173:296,299,182,159
My purpose of using --min-base-quality-score 25 is to filter variants with low base quality in raw vcf. Actually, it's not very necessary.
-
Ce Wang I think at this point I would recommend relying on the default filtering methods, as Mutect2 employs multiple read filters, which you can see at the tool docs page: https://gatk.broadinstitute.org/hc/en-us/articles/360056969692-Mutect2
You can keep following along at the github ticket where the developers will look into this as they have the capacity. I'll post your update there, and feel free to comment any further questions about this issue there.
-
@Ce Wang The change in 4.2.0.0 that had the effect you saw was this PR: https://github.com/broadinstitute/gatk/pull/6823. This fixed a bug where various read transformations were not actually having any effect because they were being applied to a temporary copy of the reads and not the reads themselves. Thus this PR made the `-min-base-quality-score` argument work as intended, which combined with the base quality reduction applied to overlapping reads triggered the severe truncation of most of your reads.
As Genevieve said, the best solution is to use the default for -min-base-quality-score.
Please sign in to leave a comment.
19 comments