Plausibly incorrectly-calculated 'contamination' for select variants on multiple samples
AnsweredREQUIRED for all errors and issues:
a) GATK version used:
b) Exact command used:
gatk Mutect2 \
-R /home/ref/BroadInstitute/Homo_sapiens_assembly38.fasta \
-I ${INDIR}/${sn}.bqsr.bam \
-O ${OUT}/${sn}.unfiltered.vcf.gz \
-L "Myeloid.dna_manifest.20180509_hg38_horizon_region_short.bed" \
gatk LearnReadOrientationModel \
-I ${INDIR}/${sn}.f1r2.tar.gz \
-O ${OUT}/${sn}.artifact_prior.tar.gz
gatk GetPileupSummaries \
-I ${INDIR}/${sn}.bqsr.bam \
-V "gnomad.exomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz"\
-L "Myeloid.dna_manifest.20180509_hg38_horizon_region_short.bed" \
-O ${OUT}/${sn}.pileups.table &&
gatk CalculateContamination \
-I ${INDIR}/${sn}.pileups.table \
-O ${OUT}/${sn}.contamination.table \
-segments ${OUT}/${sn}.tumor_segments.table
gatk FilterMutectCalls \
-R ${REF_FASTA} \
-V ${INDIR}/${sn}.unfiltered.vcf.gz \
--contamination-table ${INDIR}/${sn}.contamination.table \
--tumor-segmentation ${INDIR}/${sn}.tumor_segments.table \
-ob-priors ${OUT}/${sn}.artifact_prior.tar.gz \
-O ${OUT}/${sn}.filtered.vcf.gz
We are trying to detect some SNP mutations using targeted NGS from somatic tumors using Mutect2 and keep finding some abnormalities.
As background, we have been finding a strange issue where a few of the variants we’re trying to detect with the validating dataset for targeted WGS keeps on giving ‘contamination’ with default Mutect2 settings and default CalculateContamination thresholds to increase contamination specificity. I wish for some solutions or some clarifications regarding the process of labelling a called variant as ‘contamination’:
- Are there any references regarding the ‘relaxed’ algorithm Mutect2 uses to detect contamination, compared to Mutect?
- What QC statistics can be checked for false-positives of contamination?
- Is there an effect on not having a ‘normal’ dataset on ‘contamination’ call rates?
Would a tech-support style call also help with this case?
Thank you very much for your consideration,
Kresnodityo
-
Hi Kresnodityo Jatiputro Widianto,
Thank you for posting this question on the forum and for your patience for us to get a chance to take a closer look at your question. We (the support team) can definitely help with this question because we get many questions similar to this one! And if we do not know, we will escalate the issue and bring it to the developers.
My first recommendation in terms of understanding the contamination algorithm in Mutect2 would be to take a look at the paper our developers wrote: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf. This contains the most up to date information about how the filters work. Specifically take a look at 2F for the contamination filter.
It doesn't look like the contamination filter uses the normal sample within the FilterMutectCalls algorithm, but I do know that the ability to call high confidence somatic variants greatly decreases without a normal sample. No normal sample could definitely cause some of the issues you are seeing, as Mutect2 heavily relies on the normal sample to filter out false positives.
Once you get a chance to look at the paper, please let me know your follow up questions and I can then help further if need be.
Best regards,
Genevieve
-
Thank you for your post, Kresnodityo Jatiputro Widianto ! I want to let you know we have received your question and will be moving it to the Community Discussions -> General Discussion topic, as the Somatic topic is for reporting bugs and issues with GATK.
We'll get back to you if we have any updates or follow up questions. Please see our Support Policy for more details about how we prioritize responding to questions.
-
GNOMAD_GENOME="${REF}/gnomad.exomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz"gatk GetPileupSummaries \-I ${INDIR}/${sn}.bqsr.bam \-V ${GNOMAD_GENOME} \-L ${MANIFEST_BED} \-ip 100 \-O ${OUT}/${sn}.pileups.table &&gatk CalculateContamination \-I ${INDIR}/${sn}.pileups.table \-O ${OUT}/${sn}.contamination.table \--low-coverage-ratio-threshold 0.1 \--high-coverage-ratio-threshold 1 \-segments ${OUT}/${sn}.tumor_segments.tableAbove is an example of us attempting to change our contamination thresholds to see if this changes the number of 'contamination' in our vcf calls. However, no matter what values we put in these thresholds, it seems there are no changes to the number and which samples become 'contamination'. Is there an effect of 'contamination' calls from CalculateContamination due to us only using this GNOMAD vcf for reference instead of a matched normal?Also, what factors can affect 'strand bias' and 'low allele freq' calls? The variants we call already have acceptance criteria around 5%, so are our lack of matched-normals going to effect the number of 'strand bias' and 'low allele freq' calls? Are there other factors as well?Thank you very much for your time,Kresnodityo Jatiputro Widianto
-
I'm wondering if you are using the wrong input files to GetPileupSummaries. The input to GetPileupSummaries is not supposed to be a matched normal, it's supposed to be known sites. Take a look at this tutorial: https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2
It is supposed to be run with known variant sites and we provide files of these sites in our data resources. Is the Gnomad file known sites?
The strand bias filter is described in 2D and the allele fraction clustering model is described in 2C of the Mutect2 paper: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf
I know there are a lot of discussions on this site about these algorithms as well, which may be helpful to you.
Let me know if you have any further questions!
-
Hi Kresnodityo,
We haven't heard from you in a while so we're going to close out this ticket in our system. If you still require assistance, simply respond to this thread and we'll be happy to pick up where we left off!
Kind regards,
Genevieve
Please sign in to leave a comment.
5 comments