Why do a clear expected variant not show up in the Mutect2 vcf file
AnsweredI am running Mutect2 on a sample in tumor-only mode. This sample has mutations introduced and known to be true positive calls. However, I am unable to detect some of these calls in the vcf file after Mutect2 is run that have very clear read support as seen in IGV. I have used the –bam-output option to show the output bam and in IGV, it shows that there is no assembly in this region and no mutation event was detected. I am showing the IGV screenshot for one of such calls (chr12:25398285).
I am using the latest version GATK 4.2.0.0 and the following is the full Mutect2 command from the log file
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.2.0.0-local.jar Mutect2 -R ../resources/hg19.fa -L ../resources/coding_regions.bed -I bam_files/sample1.bam --pon ../resources/pon.vcf.gz --germline-resource ../resources/af-only-gnomad.raw.sites.hg19.vcf.gz --bam-output sample1.mutect2_out.bam --recover-all-dangling-branches true -min-pruning 1 --min-dangling-branch-length 2 --debug --max-reads-per-alignment-start 0 --genotype-pon-sites True --f1r2-tar-gz vcf_files/f1r2.sample1.tar.gz -O vcf_files/unfiltered.sample1.vcf
In the debug mode, the following log messages are generated for this region
08:01:26.086 INFO Mutect2Engine - Assembling chr12:25398242-25398320 with 14298 reads: (with overlap region = chr12:25398142-25398420)
I have another call with similar VAF that is detected in the vcf output(chr12:25380275).
chr12 25380275 . T G . . AS_SB_TABLE=3911,5343|26,21;DP=9485;ECNT=1;MBQ=36,36;MFRL=0,0;MMQ=42,42;MPOS=18;POPAF=7.30;TLOD=53.53 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:9254,47:4.970e-03:9301:5321,21:3867,26:3911,5343,26,21
The input and the output BAMs show this call with the variant.
In the logs, it shows the detection of an active region here:
08:01:23.642 INFO Mutect2Engine - Assembling chr12:25380238-25380327 with 19912 reads: (with overlap region = chr12:25380138-25380427)
08:01:24.119 INFO EventMap - >> Events = EventMap{chr12:25380275-25380275 [T*, G],}
08:01:24.154 INFO AssemblyResultSet - Trimming active region AssemblyRegion chr12:25380238-25380327 active?=true nReads=19912 with 2 haplotypes
08:01:24.154 INFO AssemblyResultSet - Trimmed region to chr12:25380255-25380295 and reduced number of haplotypes from 2 to only 2
08:01:25.383 INFO EventMap - >> Events = EventMap{chr12:25380275-25380275 [T*, G],}
I have tried troubleshooting with the steps stated in this blog. However, it does not change the output vcf. I used the force-calling mode by giving the above call in an input vcf and the call did appear in the vcf file.
chr12 25398285 . C A . . AS_SB_TABLE=4312,3630|14,8;DP=8096;ECNT=1;MBQ=36,36;MFRL=0,0;MMQ=42,42;MPOS=22;POPAF=7.30;TLOD=14.69 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:7942,22:2.576e-03:7964:3609,8:4268,13:4312,3630,14,8
However, I cannot rely on force-calling mutations on a set of calls. I am unsure if I am missing out more calls that are not showing up. Are there any parameters I need to tune so that I do not miss calls like above?
-
Nabeel Ahmed can you comment on the github ticket with what you found? The developers are tracking it there.
-
Hi Nabeel Ahmed,
Thank you for including the very thorough post, this helps us to really figure out what might be going on here.
One thing I have a question about, is that you show log files assembling an active region, which you say a region containing the variant, but it looks like it is off by 10,000 bp. You are looking for the variant at the site 25398285, whereas the active region showed contains 25380255-25380295.
Did you notice this, or did you mean to include a different section of the logs?
Thank you,
Genevieve
-
Thanks for your reply Genevieve Brandt (she/her). I am sorry for the confusion. I am showing two different sections of the log for comparison of two expected calls (one which is detected and the other which is not). I am detecting an expected call at chr12:25380275 for which I have the multi-line log with EventMap, etc.
For the expected call at chr12:25398285 that was not detected, only the following line is in the logs.
08:01:26.086 INFO Mutect2Engine - Assembling chr12:25398242-25398320 with 14298 reads: (with overlap region = chr12:25398142-25398420)
I hope this clarifies your question.
-
I see, thanks for clarifying, so for the variant chr12:25380275, it looks like there is no active region found there.
And to confirm, did you try the --linked-de-bruijn-graph and --recover-all-dangling-branches options?
-
Yes. I have. There is no change in output.
-
Giving an update. I tried Mutect2 in a previous version of GATK 4.1.3.0 and I am seeing the expected call that is being missed above.
The following is the call in the vcf output from Mutect2chr12 25398285 . C A . . DP=8160;ECNT=1;MBQ=36,36;MFRL=0,0;MMQ=42,42;MPOS=22;POPAF=7.30;TLOD=15.14 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:8007,22:2.526e-03:8029:3636,8:4305,13:4350,3657,14,8
I am running the command below
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.3.0-local.jar Mutect2 -R ../resources/hg19.fa -L ../resources/coding_regions.bed -I bam_files/sample1.bam --germline-resource ../resources/af-only-gnomad.raw.sites.hg19.vcf.gz --pon ../resources/pon.vcf.gz --max-reads-per-alignment-start 0 --genotype-pon-sites True --max-population-af 0.001 --f1r2-tar-gz vcf_files/f1r2.sample1.tar.gz -O vcf_files/unfiltered.sample1.vcf
There was no option of --linked-de-bruijn-graph or --recover-all-dangling-branches in this version so I had not included these in the command above.
It is possible that any change in parameters from GATK4.1.3.0 to GATK4.2.0.0 maybe playing a role in that call not being made in the latest version? -
Hi Nabeel Ahmed,
Sorry about the delay, I was out yesterday!
My teammate gave this a look and saw that from the log you shared, it looks like an assembly failure in the region of the chr12:25380275 variant. We are a bit confused that an older version of GATK would result in this variant being called because we haven't made any changes that should have changed the assembly engine.
Would you be able to run the --debug-graph-transformations option in this region for both GATK4.1.3.0 and GATK4.2.0.0? You will get a ton of .dot files and we would like look at these to see why they might be different. Upload these into a bug report, following these instructions.
Best,
Genevieve
-
Thank you for digging into this. I have followed the instructions and have uploaded a tar.gz with the name mutect2_nabeel_somatic_query_data_v2.tar.gz containing the dot files, log, input bam and output vcf for both GATK4.1.3.0 and GATK4.2.0.0.
Just a small correction in your comment. The assembly failure is happening in the region of chr12:25398285. The chr12:25380275 is a variant that is detected in both GATK4.1.3.0 and GATK4.2.0.0. I have given its example as a comparison.
I hope a solution could be found for calling all missed calls. -
Thank you for submitting the report, we are looking into it.
-
Nabeel Ahmed we don't have a final solution to share yet, we are still looking into this.
We did find the reason it is not being called and why you saw a change between GATK 4.1.3.0 and GATK 4.2.0.0. The older code didn't use adaptive pruning and looked at raw counts to evaluate edges in the assembly graph. Overall, we have seen improvements when using adaptive pruning, however, the reason it changed your results is because of the coverage for each allele at this site.
In your assembly graph, the reference allele has very high coverage (7506) and the alternate allele has very low coverage (22). When using raw counts, the edge with 22 reads was not thrown out, but with adaptive pruning, this looks like an error. In your IGV image, it is hard to see this coverage discrepancy between the two alleles, so that is why we didn't catch this earlier.
We are still looking into solutions. I will let you know when we have more information. Hopefully this helps you understand why you are seeing this!
Best,
Genevieve
-
Hi Nabeel Ahmed,
A correction to my earlier comment - the code from 4.1.3.0 did use adaptive pruning, but we have made some changes and improvements since then.
Can you see if you are able to recover these variants by lowering the LOD score threshold? The argument is --pruning-lod-threshold. The default is 2.3, and you can lower this by 1 to 2. Try 1.3 to 0.5 to see if that helps with these sites.
I have created a ticket on GATK's github page so that the developers can continue to look into this issue and determine if they need to make any updates in the algorithm. Here it is if you want to follow along: https://github.com/broadinstitute/gatk/issues/7232
You can post any updates regarding the LOD score threshold to that ticket as well.
Best,
Genevieve
-
Hi Genevieve Brandt (she/her),
Thank you for taking a look into this. I followed your recommendation of lowering the --pruning-lod-threshold but this call is still unable to make it to the output of Mutect2. I tried different thresholds from 1.3, 0.7, 0.5 and even 0.1 but it did not lead to any difference in detecting this call. -
Dear Genevieve Brandt (she/her), Nabeel Ahmed and the GATK Team,
Please could you advise if the issue described in the original post and comments was resolved? If so, please could you advise on the final solution? Unfortunately I cannot see any updates in the GitHub issue following the --pruning-lod-threshold update mentioned above.
Thank you for your time and help.
-
ISmolicz We have not resolved the issue. Sometimes assembly-related problems like this can be resolved by using the -linked-de-bruijn-graph option.
-
Hi David Benjamin,
Thank you for your reply to my post above.
Please may I ask why using the --linked-de-bruijn-graph is not an argument that is set to true by default and is only recommended when troubleshooting?
If there are any updates regarding the GitHub issue, it would be helpful if this could be indicated in this forum post.
Thank you for your time and help.
Kind regards.
-
No good reason. We were hesitant when it was new and experimental, but it should be the default now.
-
Hi David Benjamin,
Thank you for your reply and for the advice that --linked-de-bruijn-graph should be the default now. Just to mention in case this is helpful, Mutect2 documentation advises that this argument is set as false by default, even in the latest version of Mutect2. Therefore, it seems this must be applied by the user currently.
Are there any other arguments in this troubleshooting document that should be set to default now when using Mutect2?
Thank you for your time and help.
-
The defaults are always a good idea. Even linked de Bruijn graphs don't currently make much difference, though the next version of Mutect relies on them heavily.
-
Thank you for your reply David Benjamin.
To clarify my above question, are there any other arguments in this troubleshooting document that should be set to true from the outset when using Mutect2 now considering that you have recommended that --linked-de-bruijn-graph should be set to true as default now (e.g. --recover-all-dangling-branches)?
Kind regards.
-
The GATK engine has so many adjustable parameters that we the developers don't know the exact optimal settings. We use the defaults and recommend that users do the same unless they have an extremely good reason.
Please sign in to leave a comment.
20 comments