Slippage filter and multi-tool indel consensus calling
REQUIRED for all errors and issues:
a) GATK version used: 4.4.0.0
b) Exact command used: NA
c) Entire program log: NA
Hi GATK team,
I've been working on consensus indel calling using multiple variant callers (incl. Mutect2) recently. I notice that essentially all variants that are PASSed by multiple non-Mutect2 variant callers are identified by Mutect2, but filtered by the "slippage" filter. These mutations are almost exclusively 1bp Indels in homopolymer contexts (ie. COSMIC ID1 and ID2 patterns). I am trying to better understand the slippage filter to understand whether our prospective consensus calling strategy is resulting in these variants being allowed into the callset when they shouldn't be. I'm sure that there is very good rationale behind the implementation of the slippage filter in Mutect2/FilterMutectCalls - would you be able to point me to any validation data for this? Being able to justify doing away with a consensus strategy by simply using Mutect2 for Indel calling would also simplify our workflows quite a bit.
-
Hi Luka Culibrk
We have a quite thorough documentation about Mutect2 and FilterMutectCalls in the link below.
https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf
From this document Slippage filter is explained as follows.
For indels in short tandem repeats (STRs) FilterMutectCalls uses a simple model
for the possibility that alt reads are due to polymerase slippage. The prior
$\pi_L$ for a real variant of length 'L' comes from the allele fraction clustering
model. FilterMutectCalls assumes that polymerase slippage only occurs in STRs of
8 bases or more and only results in insertions or deletions of a single repeat unit.
The likelihood of 'a' alt reads out of 'd' total reads in the case of a real somatic
variant is given by the allele fraction clustering model. The likelihood in the case
of polymerase slippage is the marginal of binomial likelihoods over a slippage rate
with a uniform prior from 0 to 0.1, which is a regularized Beta function.
Given priors and likelihoods, the error probability follows.There are 2 parameters set for this purpose under FilterMutectCalls
--min-slippage-length <Integer>
Minimum number of reference bases in an STR to suspect polymerase slippage
Default value:8.and
--pcr-slippage-rate <Double>
The frequency of polymerase slippage in contexts where it is suspected
Default value:0.1.You should be able to adjust these values to get your filters accordingly to fit a known set of truth however you may need to pay attention that some of those variants found within databases might already be filtered as well in the original data therefore the risk is up-to you to depend on findings of other variant callers vs Mutect2.
To answer the question whether we have any validation data for this particular filter, short answer is no. Long answer, Mutect2 bioarxiv paper in the link below indicates the performance of Mutect2 and FilterMutectCalls for SNVs and INDELs and may be a source of reference for both tools' performance metrics.
https://www.biorxiv.org/content/10.1101/861054v1.full.pdf
I hope this helps.
-
Hi Gökalp Çelik, thank you for the information. I previously did read the documentation on this, I was moreso hoping for details for these classes of indels that I'm concerned about in this case. In our case, it does appear that Mutect2 is capturing a more accurate picture of the biological context of our data compared to a consensus-intersect strategy, I was simply wondering if there were more information to validate this specific filter, specifically in the context of potential false negatives that it might introduce. In this regard, I believe you have answered my question however, so thanks! Also, thank you for the preprint link.
-
Homopolymer errors are the major source of INDEL errors and also their usage for microsatellite instability is quite delicate matter and without a proper normal it is almost impossible to tell if there is really a variant there or what we observe is simply the errors caused by PCR and/or sequencing technology we have.
If you sequence germline samples with PCR positive sample preparation you will almost always observe an indel in one or more of those homopolymer regions with quite low allele fraction therefore their value as positive variants is lower than what other variants are.
One way to make sure if there is such a variant really present would be to run Mutect2 with matched tumor-normal data prepared similarly.
I hope this helps.
Please sign in to leave a comment.
3 comments