Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Germline sites in Mutect2

Answered
0

13 comments

  • Avatar
    Bhanu Gandham

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi G E,

    The Germline filter is described in depth in the Mutect2 paper, which can be found here: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf

    The section you are looking for is 2) Filtering E) Germline Filter on page 7.

    For your question 3) I don't see a bug there, they have different descriptions. --genotype-germline-sites is experimental and calls all apparent germline site even though they will ultimately be filtered. Whereas --genotype-pon-sites calls sites in the PoN even though they will ultimately be filtered.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Were any of your questions not addressed in that paper G E?

    0
    Comment actions Permalink
  • Avatar
    G E

    Not completely for question (2). There seem to be two steps where Mutect2 can filter germline variants. Once when Mutect2 is running where those variants never show up at all in the final VCF, and then a second time in which those variants show up labeled as 'germline' in the FILTER column of the final VCF. The difference and algorithms for each of these steps is not clear to me, and I don't think described explicitly in the paper.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Question 2 - Mutect2 tries to get rid of germline sites before running the assembly graph algorithm in order to save computational time. The sites that are left and are labeled as germline in the FILTER column are either near a tumor site or are unclear before reassembly.

    Question 4 -You can set any sample where you want to find somatic variants as the tumor sample and another tissue as the normal sample. You definitely need to have a normal sample however, or you will not be able to get good results.

    0
    Comment actions Permalink
  • Avatar
    G E

    Thank you! Regarding Question 2- What are the criteria used for filtering germline sites before running the assembly graph? The algorithm for that are not clear from the Mutect2 PDF documentation.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Here is one of the Mutect2 devs responding to a very similar question: https://gatk.broadinstitute.org/hc/en-us/community/posts/360075855412/comments/360014567952

    The PON is used for this filtering. There is also more information in the Mutect FAQ in question 15. 

    0
    Comment actions Permalink
  • Avatar
    G E

    Is there any way to set a threshold VAF for germline and PON sites in the normal sample below which they are genotyped by Mutect2?

    This is an important issue, because true low-level mosaic variants in normal samples will cause those sites to be systematically filtered out in the tumor by the current approach, even though these variants are not germline variants.

    So for example, an early-occurring developmental driver mutation may be in the normal sample at low VAF, and then filtered by either the PON or germline filter in the final tumor VCF calls. However, this is a true mosaic variant in the normal sample that should still show up in the final VCF.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    There is no current method to change the germline and PON filtering thresholds. The PON filtering is a hard filter. We haven't found it to be the case that a somatic driver site would be filtered out by the PON because it wouldn't be found in the PON. 

    Something that has a high allele fraction is almost certainly a germline site. A low allele fraction site is usually an error and in a tumor sample it would also be an error. You would need very high quality sequencing data and/or extra data to be confident that a site is mosaicism and not an error.

    The only way around this would to be run Mutect2 with --genotype-germline-sites and then use SelectVariants for manual filtering with your own thresholds. Some researchers do study mosaicism by running Mutect2 with two normal samples, for example a brain normal as the tumor and then a blood normal as the normal sample. 

    0
    Comment actions Permalink
  • Avatar
    G E

    1) Does turning on --genotype-germline-sites or --genotype-pon-sites change anything about the downstream filtering process (e.g. FilterMutectCalls and FilterAlignmentArtifacts)? I just want to make sure it doesn't change how those work.

    Based on the strict definition of the above options, it should not affect anything in terms of the final genotype calls and filtering process, and should simply save genotyping info in the final VCF.

    2) Also, I tried --genotype-pon-sites, but it is impossibly slow due to the large number of sites. Is this normal? Is there any way to have it only genotype PON sites that have a chance of having a variant (i.e. at least 1 ALT read)? I have a feeling it is genotyping every PON site, regardless of whether there are any ALT reads, but this is a waste of time.

    Thanks.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    G E Turning on -genotype-gemline-sites should not affect downstream filtering, but unfortunately I neglected to exclude germline sites from the clustered events filter, and so turning on this argument, thereby emitted more germline variants, makes the clustered events filter slightly overactive.  Therefore I can only recommend this argument for diagnostic purposes even thought the effect is small.

    Turning on -genotype-pon-sites provides some of the filters that use machine learning with slightly more data.  This is theoretically beneficial but usually negligible.

    -genotype-pon-sites does only genotype pon sites that have decent evidence of somatic variation, but even that is very slow.  In most cases the slowdown is a factor of 2, but it depends on the pon and the cleanliness of the data.

    0
    Comment actions Permalink
  • Avatar
    Kenneth

    Hi, I understand that this is an old post but would like to follow up with a brief clarification question wrt G E's original question 3). It would be great if the dev team could provide further replies here, thanks in advance!

    So, the posts above have clarified that the PON variants are not included in the assembly graph and not used in active region determination, and thus mostly will not appear at all in the Mutect2 callset (rather than given a filter label afterwards), and that --genotype-pon-sites affects this behavior.

    Although it seems that the --genotype-germline-sites arguments work in a similar manner, to me the posts above have not explicitly confirmed the details:

    1. The documentation for this argument says "call all apparent germline site even though they will ultimately be filtered" -- what exactly is an "apparent" germline site? Does it mean any site in the paired normal sample that is different from the reference? Or is there a specific criterion?

    2. Is it true that these "apparent" germline sites will not be used in the active region detection and assembly graph by default, but setting --genotype-germline-sites to true will reverse the default behavior (just like --genotype-germline-sites for PON)? Or does --genotype-germline-sites work in another different way? 

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    1. The details may be found in the "Finding Active Regions" section of our documentation here: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf. Basically, there is a principled but fast model for deciding whether a site has sufficient evidence in the tumor sample but not the normal sample.

    2.  By default an apparent germline site or PON site does not trigger assembly and genotyping.  However, such sites may be near to a possible somatic site that does trigger assembly, in which case they will appear in the assembly graph, get genotyped, and most likely eventually get filtered.  If you turn on --genotype-germline-sites all apparent germline sites trigger assembly and genotyping, analogous to --genotype-pon-sites.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk