Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Wrong Calculation of DP, AD, AF

0

8 comments

  • Avatar
    Tiffany Miller

    Thanks for posting. I am looking into this now. 

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Which file are you looking at in IGV? Can you take a look at Mutect2's bamout for comparison? What do those results look like? There are a couple of reasons why the DP in the vcf won't exactly match what is in IGV:

    • if the reads were downsampled because the coverage was too high
    • reads are filtered due to mapping quality
    • during realignment, some reads could be filtered out if they don't align well and those reads won't get counted in the DP
    0
    Comment actions Permalink
  • Avatar
    Max Mustermann

    The IGV summary from the bamout at the same position is the following:

    Total count: 198
    A : 0
    C : 0
    G : 0
    T : 198 (100%, 52+, 146- )
    N : 0
    ---------------
    DEL: 36
    INS: 0

    Is there a possibility to disable downsampling and filters for debug purposes (--disable-tool-default-read-filters does not work)?

    My problem is the following. With an old VarScan-based pipeline, we get the higher DP and a very different AF. Now, I need to ensure that Mutect2 works correctly and need to explain the change in DP to our project partners.

    Is there a Mutect2 mode that explains how and why Mutect2 drops reads?

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Reads that are dropped early on by the read filters are reported in the tool's stdout.  These early filters sometimes explain the difference between Mutect2 and IGV.  One common example is that IGV might be configured to show duplicates, whereas Mutect2 filters them.  Or the mapping quality thresholds might be different. 

    Reads that are removed due to being a bad fit after assembly and realignment are usually egregious mapping artifacts; I wouldn't suspect them as the cause of any discrepancy between Mutect2 and IGV.

    Most of the difference tends to be explained by realignment, which reveals that reads that look like SNVs are actually indels and vice versa.  This is especially liable to occur near the ends of reads and in repetitive sequences.

    If you want, you could post IGV screenshots of the original bam and the bamout, making sure to include reads with the deletion and reads with the SNV.  For extra diagnostic power you can highlight equivalent reads (ie search by name) that support a different allele in the bamout than in the bam.

    However, if you want an explanation for your partners you will grow tired quickly trying to justify every discrepancy.  The general idea is that aligners map each read independently, making no assumptions on the number of local haplotypes.  With local assembly and realignment Mutect2 is able to assign reads to a small set of local haplotypes, essentially performing a smart simultaneous alignment of all the reads in a region.  You will often, for example, see bams with a bunch of similar but different indels in an STR site where the Mutect2 bamout is much cleaner.

    0
    Comment actions Permalink
  • Avatar
    Max Mustermann

    I do not want to explain every discrepancy but an entirely different allele fraction of 0.156 compared to 0.401 from the previous "trusted" pipeline with VarScan raises doubts in the new tool Mutect2.

    I need more than "This is the new fundamentally better tool and the results from your previous pipeline are wrong."

    What would you do in this situation?

    EDIT: In both pipelines duplicate reads are removed from the bam files passed to Mutect2 or VarScan.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    We're always happy to diagnose specific cases, which will require the bamout screenshots I mentioned.  The general theoretical grounds for trusting realigned reads are useful but I don't think they are a substitute for manually inspecting several cases to get a feel for it.

    I should also note that we have validated the superiority of realignment over naive pileup calling fairly thoroughly: https://www.biorxiv.org/content/10.1101/861054v1.  There are credible somatic variant callers other than Mutect2 — in particular I recommend you consider Strelka2 and Lancet (our preprint doesn't have them yet because we piggy-backed off the existing MC3 paper's comparisons)— but VarScan is not one of them.

    Having said that, Mutect2 could certainly be wrong in this example.  The bamout screenshots will help.  If they don't yield an answer I can inspect a chunk of your bam file.

    0
    Comment actions Permalink
  • Avatar
    Max Mustermann

    So, currently Mutect2 does not have any statistics output option about what happened to the reads in the original bam file that are not included in the original bamout?

    Wouldn't that be a good enhancement for assisting investigation of specific cases?

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    That would make sense.  I will first need to take a closer look at the code for discarding reads.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk