Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Increased number of false positives in Mutect2 GATK 4.1.7.0

0

15 comments

  • Avatar
    registered_user

    I found out using

    --ob-priors read-orientation-model.tar.gz

    with FilterMutectCalls makes a big difference for the better. I think this is a new feature that was introduced recently. It is starting to feel like the documentation is all over the place and finding these critical changes is becoming difficult. What else am I missing?

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Public service announcement: the Mutect2 documentation is here: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf

     

    The read orientation model has been around for about 2 years, the recent change being that Mutect2 can collect the F1R2 counts file on the fly to save runtime rather than requiring the separate tool CollectF1R2Counts. I'm surprised that running read orientation filtering wasn't necessary with GATK 4.1.3 because it has always been essential for FFPE data and not very important otherwise.  Are your samples FFPE, or is there any other esoteric quality in their preparation that might cause substitution errors?

    Off the top of my head I can't think of any change between 4.1.3 and 4.1.7 that would affect false positives very much.  To diagnose this it would help us a lot if you could share a VCF of new false positives in 4.1.7 and an accompanying VCF of the same sites in the output of FilterMutectCalls with 4.1.3, if they appear at all.

    It would also help if you share your commands for Mutect2, FilterMutectCalls, and any other tools such as CalculateContamination in your Mutect2 pipeline.

    0
    Comment actions Permalink
  • Avatar
    registered_user

    We are working with PAXgene fixed material.

    Maybe I misunderstood that using the Mutect2 --f1r2-tar-gz flag would insert a tag into the raw results that would cause them to be filtered out by FilterMutectCalls without any extra params. I'm a bit confused now because I think we saw a change in our 4.1.3.0 version calls after enabling it (and filtering without --ob-priors).

    I am looking further into the issue and doing a more robust validation of the results. I'll let you know if we discover any clear issues with the calls. We did some tests with the contamination filtering and didn't see any significant changes, maybe we should try it one more time.

    Do you recommend using FilterAlignmentArtifacts or any other additional steps?

    0
    Comment actions Permalink
  • Avatar
    registered_user

    Went back to check and we did use --ob-priors in 4.3.1.0 filtering, I had just forgotten about it.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    If the read orientation artifact filter makes a big difference it must mean that PAXgene introduces substitution errors that occur sometime between extracting the DNA and ligating adapters.  When you include -ob-priors are the 4.1.7 results comparable to 4.1.3, or still inferior?

    FilterAlignmentArtifacts can be a useful tool for removing mapping errors when you care more about removing false positives than about maximizing sensitivity.

    0
    Comment actions Permalink
  • Avatar
    registered_user

    Thanks for the reply David,

    The issue has reversed after applying --ob-priors. I think with 4.1.7.0 we are actually missing more calls now (false negatives, FN). In this recent sample I was looking at, we got 14 FNs with 4.1.3.0 and now we're getting 28 (4.1.7.0). They are all detected, but filtered out with FilterMutectCalls. It may not seem like a lot, but this comparison only covers the regions in out targetted deep sequencing panel. Doubling the rate of FNs for the whole genome is more problematic. Here's a list of the 28 calls we are classifying as FNs (missed calls, should not be filtered out) and reasons for filtering:

    chr1    4153237         .       C       T       .       strand_bias
    chr1 175148066 . T TA . slippage
    chr2 154188707 . CT C . slippage
    chr3 34053364 . C A . strand_bias
    chr3 77563487 . A AT . slippage
    chr3 117343832 . C CA . slippage
    chr5 2985099 . TAATTA T . slippage
    chr5 9057846 . C G . clustered_events;haplotype
    chr5 9057853 . C T . clustered_events;haplotype
    chr5 9057991 . C T . clustered_events
    chr5 141042984 . CT C . slippage
    chr6 142332231 . G A . strand_bias
    chr8 55063415 . A T . strand_bias
    chr8 55546358 . G T . strand_bias
    chr8 85085906 . AT A . slippage
    chr8 87582133 . G C . strand_bias
    chr8 112287186 . C A . strand_bias
    chr9 122480098 . CT C . slippage
    chr11 31525425 . A T . strand_bias;weak_evidence
    chr11 78666986 . G GT . slippage
    chr12 23546637 . G A . strand_bias
    chr15 54266045 . CT C . slippage
    chr15 54420688 . AT A . slippage
    chr15 57251736 . A T . weak_evidence
    chr18 35918470 . G A . strand_bias
    chr20 20897790 . A G . normal_artifact
    chrX 109271370 . TTC T . clustered_events

    "strand_bias" as the sole reason for filtering out a SNV occurs 420 times total, so it is mostly doing a good job.I think we are seeing more "slippage" now in 4.1.7.0.

    For example, SNV chr12 23546637 is detected by our deepseq panel with 18 reads out of 331 supporting the call:

    chr12	23546637	.	G	A	0	PASS	DEPTH=331;RATIO=0.05;TCGA=0:0:313:18;

    There is really no reason to doubt it. In the WGS data that we are analyzing we see the variant in 6/64 reads. 5 of the 6 supporting reads are on the same strand, so I can understand why the filtering happens.

    Thinking about the possibility that 4.1.7.0 would actually perform better here and our ground truth data would be less accurate... I can't think of a way to prove it.

    Is there any way we can relax the filtering on calling slippages and strand bias?

     

    0
    Comment actions Permalink
  • Avatar
    registered_user

    I am experimenting with

    --pcr-slippage-rate

    to see how it affects the results.

    About the orientation bias filtering... I would like to see how the results look if I could specify that a single supporting mutation call on the other strand would be enough keep the variant from getting filtered with "strand_bias". At least for low AF variants. Is this possible?

    0
    Comment actions Permalink
  • Avatar
    registered_user

    I'm noticing the AS_SB_TABLE contains the information I'm interested in.

    AS_FilterStatus=strand_bias;AS_SB_TABLE=67,35|1,4;

    where "1,4" tells me the split between the strands. I think I'm going to apply my own post-filtering step to maybe set everything up to six or seven reads with one supporting read on the other strand to PASS.

    Update: Lowering the pcr-slippage-rate and introducing the new post-filtering step gets us back up to matching performance with 4.1.3.0. Maybe even with very slight improvement.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    I think our strand bias filter is seriously wrong here, because in our validations we have found that strand bias artifacts (which are not the same as orientation bias artifacts) are very rare in modern sequencing.  I think your post-filtering is a fine band-aid for the problem and in fact you could go a step further and just ignore the strand bias filter entirely.  It will incur at most a few false positives.  We need to fix this.

    The slippage filter is more problematic because polymerase slippage occurs in all Illumina sequencing — in fact, in all sequencing involving a polymerase.  Your ground truth is not an orthogonal validation in this regard.  We set up a nifty validation (https://www.biorxiv.org/content/10.1101/825042v1.full) and found that our slippage filter is frequently right and frequently wrong.  Relaxing it is risky, but it definitely cause a good number of false negatives.  We have figured out how to refine the slippage filter to improve things dramatically (for example, insertion slippage errors are much rarer in Illumina sequencing than deletion errors) but have not yet implemented this analysis.

    If you are able to share your unfiltered Mutect2 VCF with us for debugging I'm sure I could find out why the strand bias filter is going off the rails and fix it pretty quickly.

    0
    Comment actions Permalink
  • Avatar
    registered_user

    @David_Benjamin,

    I sent you an email regarding our data, we would be happy for you to take a look at it.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    [We're discussing this off-forum, but I will post here once I have an answer on the strand bias issue.]

    0
    Comment actions Permalink
  • Avatar
    Seung-been Steven Lee

    David Benjamin and @registered_user,

    Are there any updates regarding this? I've been running Mutect2 and want to understand better each of the commonly seen filters (e.g. slippage, orientation). FYI, I'm using the latest version of GATK 4.2.0.

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Seung-been Steven Lee,

    I spoke with David Benjamin and I believe he was unable to find anything obviously wrong with the user's results. More information on the Mutect2 filters and parameters can be found in the Tool Documentation, the Mutect2 FAQ, as well as this paper.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Rajesh Kumar Maurya

    Im using GATK 4.2.4.0 & GRCH38, getting plenty of false calls per exome,

    is there problem in Haplotypecaller, please let me know?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Rajesh Kumar Maurya thank you for your question! The advice generally in this thread applies to somatic calls. In your case, it sounds like you are doing germline analysis. 

    There are no general HaplotypeCaller bugs that would cause an excess of false positives. HaplotypeCaller is written to find possible signs of variance and we expect that users do variant filtering on the results from HaplotypeCaller. Here is an article regarding the variant filtering step: https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering

    If you have any more questions regarding HaplotypeCaller, feel free to open up a new thread and we can help you from there!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk