Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

CalculateContamination differs much in single and paired mode

Answered
0

14 comments

  • Official comment
    Avatar
    Genevieve Brandt (she/her)

    Hi linouhao,

    I have an update from our developers, they were able to take a look at your questions:

    I am wondering whether it can serve as a separate tool for contamination

    Yes, CalculateContamination can be a stand alone tool without Mutect2. 

    1 what is the cutoff of contamination

    There is no specific cutoff for contamination. It depends on your own data, analysis, and goals

    2 I combine two sample fastq, the contamination values is lower than calculate separately

    You should not be combining multiple samples for CalculateContamination. Combining samples will break the underlying assumptions of the model and the output will not be reliable. If your fastqs are from the same sample, then the contamination should be consistent if you combine the fastqs.

    3  how to interpret the following values, percentage or double, and what it stands for?

    sample contamination error
    ZZ2 0.0024162826035807497 0.003194224673422314 

    The value is a double. The sample contamination error generally should be taken with a grain of salt because there are a lot of assumptions that can go wrong. If the number is small, you can trust the estimate. If the number is large, something wrong is occurring. The value you shared indicates that the contamination is small. 

    I hope this helps out!

    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi linouhao,

    I am going to move your post into our Community Discussions -> General Discussion topic, as the Somatic topic is for reporting bugs and issues with GATK.

    You can read more about our forum guidelines and the topics here: Forum Guidelines.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    linouhao

    Thanks a lo Genevieve Brandt (she/her).

    I have two samples sequence in one batch. they are tumor-only sample.

    one has a variant;

    EGFR:NM_005228.4:exon21:c.2573T>G:p.L858R  3695 2858 77.35%. and I use file

    af-only-gnomad.raw.sites.b37.vcf.gz
    small_exac_common_3_b37.vcf.gz
    to calculate the contamination, result is
    sample contamination error
    AB1 0.03449856983776939 0.046057571685247295


    #####################################
    the other sample
    EGFR:NM_005228.4:exon21:c.2573T>G:p.L858R 1979 6 0.30%

    ample contamination error
    AB2 0.023914936098781914 0.025049172427224962


    #is there a contamination between the two samples?
    0
    Comment actions Permalink
  • Avatar
    linouhao

    This a new batch, different from the above.

    KRAS:NM_033360.3:exon2:c.34G>T:p.G12C 

    1211(total depth) 7(alt dpth) 0.58%

    (freq)

    sample contamination error
    ZZ1 0.09630143157338196 0.09423682033092913

     

     

    does this 2 sample has contamination?

    the other is

    sample contamination error
    ZZ2 0.0024162826035807497 0.003194224673422314 

    KRAS:NM_033360.3:exon2:c.34G>T:p.G12C 

    1114 155 13.91%

     

     

     

    0
    Comment actions Permalink
  • Avatar
    linouhao

    and the other impiortant question is the contamination  is a percentage or just a decimal;

    the origin code show double. and find no percentage transition

    0
    Comment actions Permalink
  • Avatar
    linouhao

    when I merge two different sample fastq(AB1 and AB2), the final result is

     

    sample contamination error
    AB1_AB2 0.025162044789905476 0.03591113387573044

    the value is lower than individual sample contamination, it makes me feel strange

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi linouhao,

    To start with your original post, if you are doing tumor only vs matched normal calculations, it makes sense that you will get different values. Matched normal analysis is much more reliable, if you have a matched normal, I would definitely recommend using it for your somatic analysis.

    Please let me know if you have further questions.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    linouhao

    Genevieve Brandt (she/her)

    Thanks a lot.

    Most of the time, we can not get the matched normals.

    I am eager to know my question answer

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Can you clarify your other questions? 

    0
    Comment actions Permalink
  • Avatar
    linouhao

    Thanks  a lot.

    1 what is the cutoff of contamination

    2 I combine two sample fastq, the contamination values is lower than calculate separately

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)
    1. There isn't a specific contamination hard filter. The contamination table is used as input to FilterMutectCalls, which uses a model for filtering. You can read more about filtering in the Mutect2 tutorial and paper
    2. We do not recommend combining multiple samples for analysis. Mutect2 is intended to be run on one sample or multiple samples from the same individual in multi-sample mode.
    0
    Comment actions Permalink
  • Avatar
    linouhao

    Genevieve Brandt (she/her)

    Thanks a lot

    my intention is for check contamination,not for calling. so it matters nothing with mutect2.

    I am wondering whether it can serve as a separate tool for contamination, and want to the answer of these question. 

    1 what is the cutoff of contamination

    2 I combine two sample fastq, the contamination values is lower than calculate separately

    3  how to interpret the following values, percentage or double, and what it stands for?

    sample contamination error
    ZZ2 0.0024162826035807497 0.003194224673422314 

     

    0
    Comment actions Permalink
  • Avatar
    linouhao

    Genevieve Brandt (she/her)

    Thanks a lot for you and developers, the answer is helpful.

    I am here want to ask a minor question, 

    "

    If the number is small, you can trust the estimate. If the number is large, something wrong is occurring. The value you shared indicates that the contamination is small. 

    "

    although you have said there is no cutoff, you also said the value I shared indicates that the contamination is small.  how here you assess the small or big

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    There is no specific cutoff we can give you, like I said, a specific cutoff is relative to your data and your experiment. 

    You can determine if the values are small or big by comparing the results of different samples to each other.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk