Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Plausibly incorrectly-calculated 'contamination' for select variants on multiple samples



  • Avatar
    Genevieve Brandt (she/her)

    Hi Kresnodityo Jatiputro Widianto,

    Thank you for posting this question on the forum and for your patience for us to get a chance to take a closer look at your question. We (the support team) can definitely help with this question because we get many questions similar to this one! And if we do not know, we will escalate the issue and bring it to the developers. 

    My first recommendation in terms of understanding the contamination algorithm in Mutect2 would be to take a look at the paper our developers wrote: This contains the most up to date information about how the filters work. Specifically take a look at 2F for the contamination filter. 

    It doesn't look like the contamination filter uses the normal sample within the FilterMutectCalls algorithm, but I do know that the ability to call high confidence somatic variants greatly decreases without a normal sample. No normal sample could definitely cause some of the issues you are seeing, as Mutect2 heavily relies on the normal sample to filter out false positives.

    Once you get a chance to look at the paper, please let me know your follow up questions and I can then help further if need be.

    Best regards,


    Comment actions Permalink
  • Avatar
    Anthony DiCi

    Thank you for your post, Kresnodityo Jatiputro Widianto ! I want to let you know we have received your question and will be moving it to the Community Discussions -> General Discussion topic, as the Somatic topic is for reporting bugs and issues with GATK.

    We'll get back to you if we have any updates or follow up questions. Please see our Support Policy for more details about how we prioritize responding to questions. 

    Comment actions Permalink
  • Avatar
    Kresnodityo Jatiputro Widianto
            gatk GetPileupSummaries \
                -I ${INDIR}/${sn}.bqsr.bam \
                -V ${GNOMAD_GENOME} \
                -L ${MANIFEST_BED} \
                -ip 100 \
                -O ${OUT}/${sn}.pileups.table &&
    gatk CalculateContamination \
                -I ${INDIR}/${sn}.pileups.table \
                -O ${OUT}/${sn}.contamination.table \
                --low-coverage-ratio-threshold 0.1 \
                --high-coverage-ratio-threshold 1 \
                -segments ${OUT}/${sn}.tumor_segments.table
    Above is an example of us attempting to change our contamination thresholds to see if this changes the number of 'contamination' in our vcf calls. However, no matter what values we put in these thresholds, it seems there are no changes to the number and which samples become 'contamination'. Is there an effect of 'contamination' calls from CalculateContamination due to us only using this GNOMAD vcf for reference instead of a matched normal?
    Also, what factors can affect 'strand bias' and 'low allele freq' calls? The variants we call already have acceptance criteria around 5%, so are our lack of matched-normals going to effect the number of 'strand bias' and 'low allele freq' calls? Are there other factors as well?
    Thank you very much for your time,
    Kresnodityo Jatiputro Widianto
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    I'm wondering if you are using the wrong input files to GetPileupSummaries. The input to GetPileupSummaries is not supposed to be a matched normal, it's supposed to be known sites. Take a look at this tutorial:

    It is supposed to be run with known variant sites and we provide files of these sites in our data resources. Is the Gnomad file known sites?

    The strand bias filter is described in 2D and the allele fraction clustering model is described in 2C of the Mutect2 paper:

    I know there are a lot of discussions on this site about these algorithms as well, which may be helpful to you.

    Let me know if you have any further questions!

    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Kresnodityo,

    We haven't heard from you in a while so we're going to close out this ticket in our system. If you still require assistance, simply respond to this thread and we'll be happy to pick up where we left off!

    Kind regards,


    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk