Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Why do a clear expected variant not show up in the Mutect2 vcf file

Answered
0

20 comments

  • Avatar
    Genevieve Brandt (she/her)

    Nabeel Ahmed can you comment on the github ticket with what you found? The developers are tracking it there.

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Nabeel Ahmed,

    Thank you for including the very thorough post, this helps us to really figure out what might be going on here.

    One thing I have a question about, is that you show log files assembling an active region, which you say a region containing the variant, but it looks like it is off by 10,000 bp. You are looking for the variant at the site 25398285, whereas the active region showed contains 25380255-25380295. 

    Did you notice this, or did you mean to include a different section of the logs?

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Nabeel Ahmed

    Thanks for your reply Genevieve Brandt (she/her). I am sorry for the confusion. I am showing two  different sections of the log for comparison of two expected calls (one which is detected and the other which is not).  I am detecting an expected call at chr12:25380275 for which I have the multi-line log with EventMap, etc. 

    For the expected call at chr12:25398285 that was not detected, only the following line is in the logs.

    08:01:26.086 INFO  Mutect2Engine - Assembling chr12:25398242-25398320 with 14298 reads:    (with overlap region = chr12:25398142-25398420)

    I hope this clarifies your question.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    I see, thanks for clarifying, so for the variant chr12:25380275, it looks like there is no active region found there.

    And to confirm, did you try the --linked-de-bruijn-graph and --recover-all-dangling-branches options?

    0
    Comment actions Permalink
  • Avatar
    Nabeel Ahmed

    Yes. I have. There is no change in output.

    0
    Comment actions Permalink
  • Avatar
    Nabeel Ahmed

    Giving an update. I tried Mutect2 in a previous version of GATK 4.1.3.0 and I am seeing the expected call that is being missed above.
    The following is the call in the vcf output from Mutect2

    chr12 25398285 . C A . . DP=8160;ECNT=1;MBQ=36,36;MFRL=0,0;MMQ=42,42;MPOS=22;POPAF=7.30;TLOD=15.14 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:8007,22:2.526e-03:8029:3636,8:4305,13:4350,3657,14,8

    I am running the command below

    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.3.0-local.jar Mutect2 -R ../resources/hg19.fa -L ../resources/coding_regions.bed -I bam_files/sample1.bam --germline-resource ../resources/af-only-gnomad.raw.sites.hg19.vcf.gz --pon ../resources/pon.vcf.gz --max-reads-per-alignment-start 0 --genotype-pon-sites True --max-population-af 0.001 --f1r2-tar-gz vcf_files/f1r2.sample1.tar.gz -O vcf_files/unfiltered.sample1.vcf


    There was no option of --linked-de-bruijn-graph or --recover-all-dangling-branches in this version so I had not included these in the command above. 

    It is possible that any change in parameters from GATK4.1.3.0 to GATK4.2.0.0 maybe playing a role in that call not being made in the latest version?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Nabeel Ahmed,

    Sorry about the delay, I was out yesterday!

    My teammate gave this a look and saw that from the log you shared, it looks like an assembly failure in the region of the chr12:25380275 variant. We are a bit confused that an older version of GATK would result in this variant being called because we haven't made any changes that should have changed the assembly engine.

    Would you be able to run the --debug-graph-transformations option in this region for both GATK4.1.3.0 and GATK4.2.0.0? You will get a ton of .dot files and we would like look at these to see why they might be different. Upload these into a bug report, following these instructions.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Nabeel Ahmed

    Thank you for digging into this. I have followed the instructions and have uploaded a tar.gz with the name mutect2_nabeel_somatic_query_data_v2.tar.gz containing the dot files, log, input bam and output vcf for both GATK4.1.3.0 and GATK4.2.0.0. 

    Just a small correction in your comment. The assembly failure is happening in the region of chr12:25398285. The chr12:25380275 is a variant that is detected in both GATK4.1.3.0 and GATK4.2.0.0. I have given its example as a comparison.


    I hope a solution could be found for calling all missed calls.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you for submitting the report, we are looking into it.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Nabeel Ahmed we don't have a final solution to share yet, we are still looking into this.

    We did find the reason it is not being called and why you saw a change between GATK 4.1.3.0 and GATK 4.2.0.0. The older code didn't use adaptive pruning and looked at raw counts to evaluate edges in the assembly graph. Overall, we have seen improvements when using adaptive pruning, however, the reason it changed your results is because of the coverage for each allele at this site.

    In your assembly graph, the reference allele has very high coverage (7506) and the alternate allele has very low coverage (22). When using raw counts, the edge with 22 reads was not thrown out, but with adaptive pruning, this looks like an error. In your IGV image, it is hard to see this coverage discrepancy between the two alleles, so that is why we didn't catch this earlier.

    We are still looking into solutions. I will let you know when we have more information. Hopefully this helps you understand why you are seeing this!

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Nabeel Ahmed,

    A correction to my earlier comment - the code from 4.1.3.0 did use adaptive pruning, but we have made some changes and improvements since then.

    Can you see if you are able to recover these variants by lowering the LOD score threshold? The argument is --pruning-lod-threshold. The default is 2.3, and you can lower this by 1 to 2. Try 1.3 to 0.5 to see if that helps with these sites.

    I have created a ticket on GATK's github page so that the developers can continue to look into this issue and determine if they need to make any updates in the algorithm. Here it is if you want to follow along: https://github.com/broadinstitute/gatk/issues/7232

    You can post any updates regarding the LOD score threshold to that ticket as well.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Nabeel Ahmed

    Hi Genevieve Brandt (she/her)

    Thank you for taking a look into this. I followed your recommendation of lowering the --pruning-lod-threshold but this call is still unable to make it to the output of Mutect2. I tried different thresholds from 1.3, 0.7, 0.5 and even 0.1 but it did not lead to any difference in detecting this call.

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Dear Genevieve Brandt (she/her), Nabeel Ahmed and the GATK Team,

    Please could you advise if the issue described in the original post and comments was resolved? If so, please could you advise on the final solution? Unfortunately I cannot see any updates in the GitHub issue following the --pruning-lod-threshold update mentioned above.

    Thank you for your time and help.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    ISmolicz We have not resolved the issue.  Sometimes assembly-related problems like this can be resolved by using the -linked-de-bruijn-graph option.

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Hi David Benjamin,

    Thank you for your reply to my post above.

    Please may I ask why using the --linked-de-bruijn-graph is not an argument that is set to true by default and is only recommended when troubleshooting?

    If there are any updates regarding the GitHub issue, it would be helpful if this could be indicated in this forum post.

    Thank you for your time and help.

    Kind regards.

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    No good reason.  We were hesitant when it was new and experimental, but it should be the default now.

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Hi David Benjamin,

    Thank you for your reply and for the advice that --linked-de-bruijn-graph should be the default now. Just to mention in case this is helpful, Mutect2 documentation advises that this argument is set as false by default, even in the latest version of Mutect2. Therefore, it seems this must be applied by the user currently.

    Are there any other arguments in this troubleshooting document that should be set to default now when using Mutect2?

    Thank you for your time and help.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    The defaults are always a good idea.  Even linked de Bruijn graphs don't currently make much difference, though the next version of Mutect relies on them heavily.

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Thank you for your reply David Benjamin.

    To clarify my above question, are there any other arguments in this troubleshooting document that should be set to true from the outset when using Mutect2 now considering that you have recommended that --linked-de-bruijn-graph should be set to true as default now (e.g. --recover-all-dangling-branches)?

    Kind regards.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    The GATK engine has so many adjustable parameters that we the developers don't know the exact optimal settings.  We use the defaults and recommend that users do the same unless they have an extremely good reason.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk