Strange differences in detected variants called by mutect2 --dont-use-soft-clipped-bases true vs false
Dear all,
I am new to the field of somatic variant calling and am currently following the guidelines in these tutorials (https://gatk.broadinstitute.org/hc/en-us/articles/360035889791 and https://gatk.broadinstitute.org/hc/en-us/articles/360035531132).
I use GATK 4.3.0.0 and my command can be found below.
I realized that in my called variants which remain after filtering and select those variants with PASS flag, there were lots of variants apparantly only stemming from using initially softclipped bases at the ends of reads, like this one here
1 46087371 . G GGCCACAGCTTCCTGGAACAATGACCAGAAACCTGGGCCTGGACTCACACTTCCTCCT . PASS AC=1;AF=0.500;AN=2;AS_FilterStatus=SITE;AS_SB_TABLE=232,85|5,1;DP=182;ECNT=1;GERMQ=93;MBQ=38,37;MFRL=252,240;MMQ=60,60;MPOS=49;NALOD=2.17;NLOD=41.44;POPAF=6.00;TLOD=14.07 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:176,6:0.038:182:102,1:60,0:189,6:132,44,5,1
When I initially inspected this variant in IGV, I did not see it at all (see screenshot 1).In screenshots, tumor sample bam is always on top and normal sample bam bottom.
Upon turning on display of softclipped reads, it turns out that the altered allele seems to be (part of) those softclipped bases (see screenshot 2).
I checked this for several of such strange, preferrably "larger" (i.e. comprising more than a handful bases) variants. Found quite some of them in my data. So I thought this option of --dont-use-soft-clipped-bases should be set to "true" in order to get rid of these (correct me if I am false) artifact or false positive variants. Indeed, upon setting --dont-use-soft-clipped-bases to "true", the above mentioned variant disappears from my vcf, and so do others of a similar kind BUT instead, some new variants come up which were previously not called at all OR flagged with something else than PASS (but e.g. with "weak_evidence", "clustered_events") so upon excluding vars from softclipped bases, my final set of selected variants actually comprises more variants than before. Does this make sense?
Example of a variant completely newly upon turning on dont-use-softclipped-bases (screenshot 3):
1 26598763 . G C . PASS AC=1;AF=0.500;AN=2;AS_FilterStatus=SITE;AS_SB_TABLE=9,111|0,3;DP=60;ECNT=1;GERMQ=93;MBQ=36,38;MFRL=249,260;MMQ=60,60;MPOS=28;NALOD=1.81;NLOD=18.87;POPAF=6.00;TLOD=6.60 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:57,3:0.064:60:38,3:19,0:57,3:2,55,0,3
An example of variant now differently filtered (see screenshot 4), with --dont-use-soft-clipped-bases = false
1 182852688 . A C . clustered_events;haplotype;strand_bias;weak_evidence AS_FilterStatus=weak_evidence,strand_bias;AS_SB_TABLE=43,46|0,2;DP=91;ECNT=3;GERMQ=93;MBQ=33,37;MFRL=248,278;MMQ=60,60;MPOS=9;NALOD=1.69;NLOD=12.96;POPAF=6.00;TLOD=4.25 GT:AD:AF:DP:F1R2:F2R1:FAD:PGT:PID:PS:SB 0|0:44,0:0.021:44:17,0:23,0:44,0:0|1:182852683_G_T:182852683:23,21,0,0 0|1:45,2:0.060:47:16,1:20,1:46,2:0|1:182852683_G_T:182852683:20,25,0,2
And now with --dont-use-soft-clipped-bases = true
1 182852688 . A C . PASS AS_FilterStatus=SITE;AS_SB_TABLE=41,45|1,3;DP=93;ECNT=2;GERMQ=93;MBQ=33,37;MFRL=252,276;MMQ=60,60;MPOS=13;NALOD=1.65;NLOD=13.24;POPAF=6.00;TLOD=11.15 GT:AD:AF:DP:F1R2:F2R1:FAD:PGT:PID:PS:SB 0|0:44,0:0.022:44:17,0:24,0:44,0:0|1:182852683_G_T:182852683:23,21,0,0 0|1:42,4:0.106:46:16,2:20,2:41,4:0|1:182852683_G_T:182852683:18,24,1,3
Can someone explain to me how this comes and if and what I am potentially doing wrong? What would be the best way to get rid of variants stemming from soft clipped bases - apart from hard clipping?
Here is my complete command:
/software/GATK-4.3.0.0/gatk Mutect2 \
-R human_g1k_v37.fasta \
-I tumor.bam \
-I normal.bam \
-normal normal \
--panel-of-normals somatic-b37_Mutect2-exome-panel.vcf \
--germline-resource af-only-gnomad.raw.sites.grch37.vcf.gz \
--dont-use-soft-clipped-bases true \
-O somatic_mutations.vcf.gz \
/software/GATK-4.3.0.0/gatk GetPileupSummaries \
-I tumor.bam \
-V somatic-b37_small_exac_common_3.vcf \
-L somatic-b37_small_exac_common_3.vcf \
-O somatic_mutations.tumor.getpileupsummaries.table
/software/GATK-4.3.0.0/gatk GetPileupSummaries \
-I normal.bam \
-V somatic-b37_small_exac_common_3.vcf \
-L somatic-b37_small_exac_common_3.vcf \
-O somatic_mutations.control.getpileupsummaries.table \
/software/GATK-4.3.0.0/gatk CalculateContamination \
-I somatic_mutations.tumor.getpileupsummaries.table \
-matched somatic_mutations.control.getpileupsummaries.table \
-O somatic_mutations.contamination.table \
/software/GATK-4.3.0.0/gatk FilterMutectCalls \
-R human_g1k_v37.fasta \
-V somatic_mutations.vcf.gz \
--contamination-table somatic_mutations.contamination.table \
-O somatic_mutations.oncefiltered.vcf \
/software/GATK-4.3.0.0/gatk SelectVariants \
-R human_g1k_v37.fasta \
-V somatic_mutations.oncefiltered.vcf \
--sample-name tumor
--exclude-filtered
-O somatic_mutations.selectedvars.vcf
Any help is very much appreciated! Thanks in advance.
-
Jana Marie Schwarz turning off soft-clipped bases is usually not the right thing to do. It takes information away from Mutect2, which as you have seen can change calls in both directions.
I acknowledge that you want to disable soft-clips because they appear to be causing a false positive. Perhaps they are -- Mutect2 is far from perfect. However, very often soft clips are the only evidence for real indels. To see what Mutect2's local assembly engine was "thinking", turn on the --bamout bamout.bam option and then view bamout.bam in IGV.
The only circumstance in which I would always recommend discarding soft clips is when somewhere upstream in your pipeline things that should have been marked as hard clips were instead soft-clipped. A soft-clipped base means something of biological origin for which the aligner simply couldn't assign a location. Given infinite computing resources, aligners would search for large indels and there would be no such thing as soft clips.
Please sign in to leave a comment.
1 comment