Reducing false positives in somatic variant calling
I am using GATK version v4.2.6.1. I am doing somatic variant calling using Mutect2 and FilterMutectCalls using the default parameters.
I am getting a lot of calls in my output VCF with "weak_evidence" and/or "strand_bias". From the Mutect2 FAQ, it states that these calls are false positives. My calls categorized as "PASS" are only about 10% of the total calls, which is a very small proportion of the total data.
I realize that somatic variant calling often involves many false positive calls but was hoping to find a way to potentially reduce these false positive calls in the final VCF. I have looked online on multiple gatk forum posts and don't have a clear answer thus was hoping for some guidance from the team. I would appreciate your help!
-
Hi Fia
Unless you are performing tumor-normal matched calling your variants will always contain more false positives. You may try reducing them to a level by adjusting minimum AF to tumor purity levels however this may still be superficial compared to what an actual matched normal can provide.
Our team suggests using the PON we created, gnomad AF only resource, inputting read orientation metrics and a possible matched normal for best results with Mutect2.
I hope this helps.
-
Thank you for the feedback. I have been doing a lot of reading to figure out how to filter out false positives. Once thing that came up while I was reading was the formatting of the arguments for Muect2. I saw a website suggest that name of the tumor and normal sample must be specified after the respective input.bam or else it will impact results, while I didn't see this explicitly on the GATK website. I ran both scripts I am specifying below and got different total output variants. Can you please confirm which is correct.
Additionally, the PoN GATK provides I believe is derived from blood samples. However, I couldn't find which process was used for sequencing. Was it Illumina?
Many thanks.
gatk --java-options "-Xmx${command_mem}m" Mutect2 \-R ${ref_fasta} \-I ${tumor_bam} \-I ${normal_bam} \-normal B_111_1111 \-tumor M_111_1111 \${"--germline-resource " + gnomad} \${"-L " + intervals} \-O "${output_vcf}" \-bamout bamout.bam \${true='--f1r2-tar-gz f1r2.tar.gz' false='' run_ob_filter} \-pairHMM AVX_LOGLESS_CACHING \--native-pair-hmm-threads 1 \--smith-waterman AVX_ENABLED \${m2_extra_args}orgatk --java-options "-Xmx${command_mem}m" Mutect2 \-R ${ref_fasta} \-I ${tumor_bam} \-tumor M_111_1111 \-I ${normal_bam} \-normal B_111_1111 \${"--germline-resource " + gnomad} \${"-L " + intervals} \-O "${output_vcf}" \-bamout bamout.bam \${true='--f1r2-tar-gz f1r2.tar.gz' false='' run_ob_filter} \-pairHMM AVX_LOGLESS_CACHING \--native-pair-hmm-threads 1 \--smith-waterman AVX_ENABLED \${m2_extra_args} -
@Fia It is totally normal for FilterMutectCalls to filter 90% or more of Mutect2 variant calls. Mutect2 is designed to be very permissive and naive, leaving almost all the responsibility for filtering to FilterMutectCalls.
Our public PoNs are derived from Illumina sequencing and work pretty well for all Illumina samples, regardless of tissue type etc.
The order of arguments is irrelevant in the GATK, so your two commands really shouldn't differ. In any case, though, the -tumor argument is unnecessary and deprecated. Mutect2 only uses the -normal argument and assumes everything else is a tumor.
-
Hello everyone,
I'm currently working with exome sequencing data in a tumor-only mode using GATK's Mutect2 and FilterMutectCalls. I'm facing an issue where over 90% of the variants are being filtered out by FilterMutectCalls, leaving only about 3% of the variants with the PASS flag.
I'm wondering why this is happening and what might be causing such a high filtering rate. Additionally, which filters could be adjusted or relaxed to retain more variants without compromising the integrity of the results? Any suggestions or insights would be greatly appreciated.
Thank you!
-
Hi Michelle
There are a bunch of filters set in motion for Mutect2 and FilterMutectCalls. Details of these filters are explained in the document below.
https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf
Each of these filters have their own model and a combination of all these filters cause many of the findings to be filtered out. Mutect2 is a quite sensitive caller therefore any little change that may expose itself as a variation in the final assembly will come out in the raw data. FilterMutectCalls will consider each of those filters and apply them to all sites and in the end you will get a combination of filters applied to those non-PASS sites.
Since you are using tumor-only approach our suggestions would be to use the Panel-of-Normals and germline resource we provided as supplementary filters to make sure that you do not capture artifacts and possible germline events as somatic variation. Of course having a matched normal is the best approach.
For allele fraction filtering you need to make sure that you know the fraction of tumor cells in your sample which can help removing or including more variants in your data.
Clustered events filter can be adjusted to include or exclude more variants from filters however you need to make sure that your known valid variants or false positives are not adversely affected by this filter. Our defaults are usually quite balanced for many of these filters.
I hope this helps.
Please sign in to leave a comment.
5 comments