Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Mutect2 - somatic variant calling with/without matched normal sample

1

14 comments

  • Avatar
    Bhanu Gandham

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

     

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    How was the GDC PoN generated and how many samples went into it?

    1
    Comment actions Permalink
  • Avatar
    D B

    Hi David,

    Sorry to keep bothering you with all my questions.

    From my previous discussion with people from GDC, they use 4,000+ blood normal samples to create PoN using GATK4 (v4.0.4.0). Some information on their current pipeline can be found here: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/#tumor-only-variant-calling-workflow

    In my case, I'm focusing on only breast cancer so I wonder if I should only include females on PoN.

    0
    Comment actions Permalink
  • Avatar
    D B

    In case anyone was wondering if this is the case for other samples as well, I have tried with a couple of other matched data and saw similar numbers (10-15% overlap between tumor-only and matched normal/tumor workflow).

    I also confirmed that variants unique to unmatched tumor-only workflow are not germline variants using HaplotypeCaller. Below is the command I used for germline calling (no filtering done in order to test any potential germline variants):

    HaplotypeCaller

    gatk --java-options HaplotypeCaller -R hg38.fa -I normal.bam -O normal.vcf.gz

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    To be honest, the best you can ever hope to achieve with tumor-only calling is a set of candidate variants, most of which are actually germline variants.  Even if you were extremely conservative and removed every allele in gnomAD regardless of frequency, that would still leave several tens of thousands of unique germline variants.

    In the case of low-VAF subclones and impure samples with lots of normal DNA mixed in, where the allele fractions of the variants you want differ significantly from the diploid het fraction of 1/2, FilterMutectCalls can do better.  In general, however, a small overlap between tumor-only and matched normal calling is inevitable.

    Rare germline variants are one source of difference.  They don't necessarily lead to tumor-only calls that get filtered out with the matched normal.  More often, the existence of rare germline variants forces FilterMutectCalls to be conservative and overfilter real somatic variants with an allele fraction anywhere near 1/2.

    The other source of difference, which always leads to tumor-only calls that don't exist in the normal (or show up in the output of HaplotypeCaller) are mapping artifacts that can be detected from the matched normal.  The basic idea is that difference genomes have different SVs and other variation that affects mapping error.  For example, one SNP in a centromere reference gap may tip the scale in favor of a mapping error elsewhere in the genome.  To the extent that the variation causing this is common, a PoN can and does help with this, but there is enough rare variation that this does not suffice.

    It occurs to me, though we have never tried this, that using a paternal and maternal sample as two matched normals (M2 lets you do this by specifying -I for the tumor and both normals and -normal for both normals) might help a lot.  Of course, if you don't have a matched normal you probably don't have the parents.

    That PoN sounds fine, and I see no reason to exclude males.

    1
    Comment actions Permalink
  • Avatar
    vctrymao

    Hi David and D,

    I was interested in looking at the same thing. 

    D, you say that you do not see germline mutations in the variants unique to Mutect2 tumor-only calling? I wonder, how did you run HaplotypeCaller? To my understanding, it is difficult to capture rare/unique germline events (singletons, I think they're called?) with HaplotypeCaller, as it's followup GenotypeGVCFs uses germline mutations in multiple samples to boost confidence. 

    David, you say that "the existence of rare germline variants forces FilterMutectCalls to be conservative and overfilter real somatic variants with an allele fraction anywhere near 1/2.". How does Mutect2 detect these rare germline variants in the first place to know to overfilter?

    Do you mind also elaborating on the mapping error? I'm not quite sure I understand what you were saying, and how a PoN fits in. 

     

    Thanks

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Mutect2 detects rare germline variants the same way it detects any other variant.  The point is that they are so rare as to be absent even from gnomAD, so there is no prior knowledge suggesting that they are germline.

    Mapping error is when from reads from one part of the genome are aligned to another part of the genome (this can happen due to incompleteness of the reference, structural variation, and homology).  Since they are real DNA sequence you can't detect them the same way you detect errors from sequencing and sample preparation.  There are some signatures we can look for, but a panel of normals is also very helpful because these errors tend to occur in the same places from person to person.

    0
    Comment actions Permalink
  • Avatar
    vctrymao

    I see. I thought gnomAD was only used in FilterMutectCalls for the germline filter as a prior? I also thought that there were ways to estimate priors if the candidate variant was not found in the population database? 

    You also said that "they don't necessarily lead to tumor-only calls that get filtered out with the matched normal". I'm a bit confused; are you saying that these rare germline events can't be filtered out with a matched normal? Are you saying that, generally speaking, Mutect2 needs a population database prior even with a matched normal to filter out germline events? 

    It would be very helpful if you could elaborate which parameters in which statistical model in Mutect2 + filters are affected, as I am trying to understand the methodology as well.

     

    0
    Comment actions Permalink
  • Avatar
    D B

    Hey vctrymao,

    Aside from the HaplotypeCaller command that I mentioned in one of my comments, nothing else was run. Like you said, it is generally advised (GATK best practices) to run multiple samples together as a part of germline variant calling pipeline. However, in order to address the question I had at the time, I decided to run HaplotypeCaller individually for a handful of samples without any filtering to keep all variants.

    For your question regarding the usage of population resource with respect to filtering out germline events, I think referring to below link under the section 'A variant allele in the case sample is not called if the site is variant in controls' (towards the bottom) will help:  https://gatk.broadinstitute.org/hc/en-us/articles/360035890491-Somatic-calling-is-NOT-simply-a-difference-between-two-callsets#:~:text=HaplotypeCaller%20is%20designed%20to%20call,designed%20to%20call%20somatic%20variants.

    If you decide to run any sort of tests relevant to this post, please do update!

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    vctrymao

    You are correct that the allele frequency from the germline resource is used as a prior.

    If a variant is not in the germline resource, we assign a default allele frequency that is somewhat rarer than 1/(size of germline resource).  That is, if your germline resource of 100,000 diploid samples doesn't have an allele, we can guess that the frequency is less than 1 in 200,000.

    By "they don't necessarily lead to tumor-only calls that get filtered out with the matched normal" I meant that rare germline variants sometimes get filtered even in tumor-only mode.

    Mutect2 should always be run with a germline resource, even in matched normal mode, although it is designed to run as well as possible without a germline resource.

     

    0
    Comment actions Permalink
  • Avatar
    vctrymao

    Thank you. I guess I am still confused about a few things. 

    1. If rare germline variants sometimes get filtered even in tumor-only mode, that's good, right? So the problem is with the rare germline variants that are still not filtered out? What characterizes these from germline mutations that do get filtered out?

    2. Are you saying that most germline mutations will be captured in something like gnomAD, so there will be a prior for the large majority of candidate germline mutations?

    3. You also say that "the existence of rare germline variants forces FilterMutectCalls to be conservative and overfilter real somatic variants with an allele fraction anywhere near 1/2." I still don't understand how this is working. What aspect of rare germline variants forces FilterMutectCalls to be conservative? It seems to me the only difference between a rare and common germline variant is the population frequency prior. But since somatic mutations also have no population prior, are you saying that because of this, Mutect2 calls everything with a VAF of 1/2 as germline?

    4. What are the calls that exist in the matched-normal calls but not the tumor-only calls? In what situations would a matched-normal help pick up a somatic variant that a tumor-only caller could not see?

    5. Can HaplotypeCaller pick up on rare germline variants? I was wondering if you could counteract my point in 3). by allowing somatic variants of VAF near 1/2 to pass the filters, and then filter all the remaining (and rare) germline variants out via HaplotypeCaller. 

     

    0
    Comment actions Permalink
  • Avatar
    D B

    Hey vctrymao,

    With regards to #4, David Benjamin has covered potential scenarios in one of the previous comments:

    https://gatk.broadinstitute.org/hc/en-us/community/posts/360057810051/comments/360009638892

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    1. If rare germline variants sometimes get filtered even in tumor-only mode, that's good, right? So the problem is with the rare germline variants that are still not filtered out? 

    Yes and yes.

    What characterizes these from germline mutations that do get filtered out?

    It all depends on how well the allele fraction fits the spectrum determined by the somatic clustering model versus the germline allele frequencies given by the local copy number (if using the -tumor-segmentation input from CalculateContamination; otherwise copy number is assumed to be 2 everywhere).

    2. Are you saying that most germline mutations will be captured in something like gnomAD, so there will be a prior for the large majority of candidate germline mutations?

    Yes, but the problem is that that rare germline variants are a large fraction of germline variants.  Rather it's that rare germline variants are more common than somatic variants.

    3. You also say that "the existence of rare germline variants forces FilterMutectCalls to be conservative and overfilter real somatic variants with an allele fraction anywhere near 1/2." I still don't understand how this is working. What aspect of rare germline variants forces FilterMutectCalls to be conservative?

    It seems to me the only difference between a rare and common germline variant is the population frequency prior. But since somatic mutations also have no population prior, are you saying that because of this, Mutect2 calls everything with a VAF of 1/2 as germline?

    See the answer to #1.

    4. What are the calls that exist in the matched-normal calls but not the tumor-only calls? In what situations would a matched-normal help pick up a somatic variant that a tumor-only caller could not see?

    A matched normal can give very good evidence that a variant is definitely not a germline variant

    5. Can HaplotypeCaller pick up on rare germline variants?

    Absolutely.

    I was wondering if you could counteract my point in 3). by allowing somatic variants of VAF near 1/2 to pass the filters, and then filter all the remaining (and rare) germline variants out via HaplotypeCaller. 

    You could do this, but I don't see what it would accomplish.  HaplotypeCaller can't distinguish somatic variant with large allele fractions from germline variants

    0
    Comment actions Permalink
  • Avatar
    ming hu

    Hi,

      Where can I download to this file in GATK , gatk4_mutect2_4136_pon.vcf.gz, Can you give me a link?

    thanks

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk