Germline sites in Mutect2
AnsweredHi, I have a few technical questions about possible germline sites in 'normal mosaic' sites in Mutect2 (4.2.0).
1) What does the 'germline' filter in the final Mutect VCF mean exactly? It isn't clear from the documentation if these are only SNVs that were found in the --germline-resource, or whether these also represent variants found in the matched normal?
2) Are all variants that mutect calls in both the tumor and the normal samples present in the final VCF? Or is there an initial filtering step internally in Mutect that calls likely somatic variants using information from the tumor and normal samples, and therefore the final VCF does not contain all germline sites? In other words, are true germline sites found in both the normal sample and the tumor in the final Mutect VCF? I'm assuming it must be the latter because the number of variants in the final Mutect VCF seems too small to include all germline variants, but if that is the case, then what exactly are the sites labeled as 'germline' in the final VCF?
3) Mutect2's documentation has an error. The --genotype-germline-sites and --genotype-pon-sites have the same documentation: "Usually we exclude sites in the panel of normals from active region determination, which saves time. Setting this to true causes Mutect to produce a variant call at these sites. This call will still be filtered, but it shows up in the vcf. "
What is the difference between them, and what does genotype-germline-sites do?
4) Do you have a suggestion for how to detect variants that are both in a tumor sample and a normal sample, but the variant is at low level in the normal sample such that it might be a true normal somatic variant? This is similar to tumor-in-normal contamination but not quite, because it will only be a very small subset of tumor variants that have this property, whereas with tumor-in-normal contamination, many tumor variants will be found in the normal sample.
Thanks!
-
Hi ,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
-
Hi G E,
The Germline filter is described in depth in the Mutect2 paper, which can be found here: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf
The section you are looking for is 2) Filtering E) Germline Filter on page 7.
For your question 3) I don't see a bug there, they have different descriptions. --genotype-germline-sites is experimental and calls all apparent germline site even though they will ultimately be filtered. Whereas --genotype-pon-sites calls sites in the PoN even though they will ultimately be filtered.
Best,
Genevieve
-
Were any of your questions not addressed in that paper G E?
-
Not completely for question (2). There seem to be two steps where Mutect2 can filter germline variants. Once when Mutect2 is running where those variants never show up at all in the final VCF, and then a second time in which those variants show up labeled as 'germline' in the FILTER column of the final VCF. The difference and algorithms for each of these steps is not clear to me, and I don't think described explicitly in the paper.
-
Question 2 - Mutect2 tries to get rid of germline sites before running the assembly graph algorithm in order to save computational time. The sites that are left and are labeled as germline in the FILTER column are either near a tumor site or are unclear before reassembly.
Question 4 -You can set any sample where you want to find somatic variants as the tumor sample and another tissue as the normal sample. You definitely need to have a normal sample however, or you will not be able to get good results.
-
Thank you! Regarding Question 2- What are the criteria used for filtering germline sites before running the assembly graph? The algorithm for that are not clear from the Mutect2 PDF documentation.
-
Here is one of the Mutect2 devs responding to a very similar question: https://gatk.broadinstitute.org/hc/en-us/community/posts/360075855412/comments/360014567952
The PON is used for this filtering. There is also more information in the Mutect FAQ in question 15.
-
Is there any way to set a threshold VAF for germline and PON sites in the normal sample below which they are genotyped by Mutect2?
This is an important issue, because true low-level mosaic variants in normal samples will cause those sites to be systematically filtered out in the tumor by the current approach, even though these variants are not germline variants.
So for example, an early-occurring developmental driver mutation may be in the normal sample at low VAF, and then filtered by either the PON or germline filter in the final tumor VCF calls. However, this is a true mosaic variant in the normal sample that should still show up in the final VCF.
-
There is no current method to change the germline and PON filtering thresholds. The PON filtering is a hard filter. We haven't found it to be the case that a somatic driver site would be filtered out by the PON because it wouldn't be found in the PON.
Something that has a high allele fraction is almost certainly a germline site. A low allele fraction site is usually an error and in a tumor sample it would also be an error. You would need very high quality sequencing data and/or extra data to be confident that a site is mosaicism and not an error.
The only way around this would to be run Mutect2 with --genotype-germline-sites and then use SelectVariants for manual filtering with your own thresholds. Some researchers do study mosaicism by running Mutect2 with two normal samples, for example a brain normal as the tumor and then a blood normal as the normal sample.
-
1) Does turning on --genotype-germline-sites or --genotype-pon-sites change anything about the downstream filtering process (e.g. FilterMutectCalls and FilterAlignmentArtifacts)? I just want to make sure it doesn't change how those work.
Based on the strict definition of the above options, it should not affect anything in terms of the final genotype calls and filtering process, and should simply save genotyping info in the final VCF.
2) Also, I tried --genotype-pon-sites, but it is impossibly slow due to the large number of sites. Is this normal? Is there any way to have it only genotype PON sites that have a chance of having a variant (i.e. at least 1 ALT read)? I have a feeling it is genotyping every PON site, regardless of whether there are any ALT reads, but this is a waste of time.
Thanks.
-
G E Turning on -genotype-gemline-sites should not affect downstream filtering, but unfortunately I neglected to exclude germline sites from the clustered events filter, and so turning on this argument, thereby emitted more germline variants, makes the clustered events filter slightly overactive. Therefore I can only recommend this argument for diagnostic purposes even thought the effect is small.
Turning on -genotype-pon-sites provides some of the filters that use machine learning with slightly more data. This is theoretically beneficial but usually negligible.
-genotype-pon-sites does only genotype pon sites that have decent evidence of somatic variation, but even that is very slow. In most cases the slowdown is a factor of 2, but it depends on the pon and the cleanliness of the data.
-
Hi, I understand that this is an old post but would like to follow up with a brief clarification question wrt G E's original question 3). It would be great if the dev team could provide further replies here, thanks in advance!
So, the posts above have clarified that the PON variants are not included in the assembly graph and not used in active region determination, and thus mostly will not appear at all in the Mutect2 callset (rather than given a filter label afterwards), and that --genotype-pon-sites affects this behavior.
Although it seems that the --genotype-germline-sites arguments work in a similar manner, to me the posts above have not explicitly confirmed the details:
1. The documentation for this argument says "call all apparent germline site even though they will ultimately be filtered" -- what exactly is an "apparent" germline site? Does it mean any site in the paired normal sample that is different from the reference? Or is there a specific criterion?
2. Is it true that these "apparent" germline sites will not be used in the active region detection and assembly graph by default, but setting --genotype-germline-sites to true will reverse the default behavior (just like --genotype-germline-sites for PON)? Or does --genotype-germline-sites work in another different way?
-
1. The details may be found in the "Finding Active Regions" section of our documentation here: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf. Basically, there is a principled but fast model for deciding whether a site has sufficient evidence in the tumor sample but not the normal sample.
2. By default an apparent germline site or PON site does not trigger assembly and genotyping. However, such sites may be near to a possible somatic site that does trigger assembly, in which case they will appear in the assembly graph, get genotyped, and most likely eventually get filtered. If you turn on --genotype-germline-sites all apparent germline sites trigger assembly and genotyping, analogous to --genotype-pon-sites.
Please sign in to leave a comment.
13 comments