What does contamination mean in GATK's somatic pipeline,?
Hi GATK team,
I've been exploring GATK's somatic variant calling workflow and trying to understand how contamination is estimated using `GetPileupSummaries` and `CalculateContamination`. I've read the tool docs, but I'm still unclear about some of the core ideas. I'd appreciate clarification on the following points:
---
What exactly is GATK measuring when it estimates "contamination"?
- Is the contamination referring to DNA from other individuals, or normal (non-tumor) cells, or both?
- How is contamination estimation conceptually different from tumor purity?
---
Why is gnomAD (or any germline population variant resource) required?
- What role does it play in contamination estimation?
---
How should I interpret the output of GetPileupSummaries?
| contig | position | ref_count | alt_count | other_alt_count | allele_frequency |
|--------|----------|-----------|-----------|------------------|------------------|
| chr6 | 29942512 | 9 | 0 | 0 | 0.063 |
| chr6 | 29942517 | 13 | 1 | 0 | 0.062 |
| chr6 | 29942525 | 13 | 7 | 0 | 0.063 |
| chr6 | 29942547 | 36 | 0 | 0 | 0.077 |
- I understand `alt_count` refers to reads supporting the ALT allele present in the germline resource (e.g., gnomAD), and `allele_frequency` is its population AF.
- But what exactly does `other_alt_count` mean?
- How do these values get used by `CalculateContamination` to infer contamination levels?
---
What do the default allele frequency thresholds (0.01 and 0.2) mean?
I noticed from the docs:
> "Note the default maximum population AF (--maximum-population-allele-frequency or -max-af) is set to 0.2... Likewise, the default minimum population AF (--minimum-population-allele-frequency or -min-af) is set to 0.01..."
- Why were 0.01 and 0.2 chosen as default values?
- How do they affect contamination estimation?
---
I've searched through the forums and tool documentation but haven't found a clear breakdown of the statistical rationale and data interpretation here. I'd really appreciate any technical insights or references that explain how GATK uses this summary table and population priors to estimate contamination.
Thanks in advance!
Please sign in to leave a comment.
0 comments