How should I deal with tandem repeat?
Hi guys, I need some insight from your professionals.
I'm trying to call somatic mutations from multiple tumor-normal sample pairs WES data by following the Mutect2 best practice (https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-) and the How-to tutorial (https://gatk.broadinstitute.org/hc/en-us/articles/360035531132).
I called mutect2 tumor-only mode on a bunch of normal sample to make PON, then followed by Mutect2 -- LearnReadOrientationModel -- GetPileupSummaries + CalculateContaminationthe, finally FilterMutectCalls. Detailed commands for one of my samples listed down below.
Can you please provide
a) GATK version used : 4.1.7.0
b) Exact GATK commands used
1. Calling Mutect2
```
/share/home/na.su/conda/envs/gatk/bin/java -jar \
/share/home/na.su/software/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar Mutect2 \
--input /share/home/na.su/work/icc/_i/bam/117.RAW.bam \
--input /share/home/na.su/work/icc/_i/bam/118.RAW.bam \
--reference /share/home/robots/anno/b37/human_g1k_v37.fasta \
--output /share/home/na.su/work/icc/_o/mutect2/117_118/unfiltered.vcf.gz \
--normal-sample 118 \
--intervals /share/home/na.su/work/icc/_i/good.bed \
--germline-resource /share/home/na.su/work/icc/_o/af_only_gnomad/af_sorted.vcf.gz \
--panel-of-normals /share/home/na.su/work/icc/_o/pon/_pon_51_samples.vcf.gz \
--f1r2-tar-gz /share/home/na.su/work/icc/_o/mutect2/117_118/f1r2.tar.gz \
--native-pair-hmm-threads 16
```
2. Calling LearnReadOrientationModel
```
/share/home/na.su/conda/envs/gatk/bin/java -jar \
/share/home/na.su/software/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar LearnReadOrientationModel \
--input /share/home/na.su/work/icc/_o/mutect2/117_118/f1r2.tar.gz \
--output /share/home/na.su/work/icc/_o/mutect2/117_118/ReadOrientation.tar.gz \
--tmp-dir /share/home/na.su/work/local_tmp
```
3. Calling GetPileupSummaries followed by CalculateContamination
```
/share/home/na.su/conda/envs/gatk/bin/java -jar \
/share/home/na.su/software/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar GetPileupSummaries \
--input /share/home/na.su/work/icc/_i/bam/117.RAW.bam \
--variant /share/home/na.su/work/icc/_i/small_exac_common_3_b37.vcf.gz \
--intervals /share/home/na.su/work/icc/_i/small_exac_common_3_b37.vcf.gz \
--output /share/home/na.su/work/icc/_o/mutect2/117_118/GetPileupSummaries.table \
--tmp-dir /share/home/na.su/work/local_tmp \
&& \
/share/home/na.su/conda/envs/gatk/bin/java -jar \
/share/home/na.su/software/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar CalculateContamination \
--input /share/home/na.su/work/icc/_o/mutect2/117_118/GetPileupSummaries.table \
--output /share/home/na.su/work/icc/_o/mutect2/117_118/CalculateContamination.table \
--tumor-segmentation /share/home/na.su/work/icc/_o/mutect2/117_118/segmentation.table \
--tmp-dir /share/home/na.su/work/local_tmp
```
4. Calling FilterMutectCalls
```
/share/home/na.su/conda/envs/gatk/bin/java -jar \
/share/home/na.su/software/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar FilterMutectCalls \
--output /share/home/na.su/work/icc/_o/mutect2/117_118/filtered.vcf.gz \
--reference /share/home/robots/anno/b37/human_g1k_v37.fasta \
--variant /share/home/na.su/work/icc/_o/mutect2/117_118/unfiltered.vcf.gz \
--stats /share/home/na.su/work/icc/_o/mutect2/117_118/unfiltered.vcf.gz.stats \
--tumor-segmentation /share/home/na.su/work/icc/_o/mutect2/117_118/segmentation.table \
--contamination-table /share/home/na.su/work/icc/_o/mutect2/117_118/CalculateContamination.table \
--orientation-bias-artifact-priors /share/home/na.su/work/icc/_o/mutect2/117_118/ReadOrientation.tar.gz \
--intervals /share/home/na.su/work/icc/_i/good.bed \
--min-allele-fraction 0.05 \
--min-median-read-position 10 \
--min-reads-per-strand 2 \
--unique-alt-read-count 5
```
c) The entire error log if applicable. No, the Mutect2 did not report any error.
Here is our doubt. My colleagues and I got confused about some short tandem repeats (STRs) who got "PASS" tags after FilterMutectCalls. Here's an example:
This is the vcf line of one STR who passed the filter yet clearly not that solid after IGV check.
```
##fileformat=VCFv4.2
##FILTER=<ID=FAIL,Description="Fail the site if all alleles fail but for different reasons.">
##FILTER=<ID=PASS,Description="Site contains at least one allele that passes filters">
##FILTER=<ID=base_qual,Description="alt median base quality">
##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">
##FILTER=<ID=contamination,Description="contamination">
##FILTER=<ID=duplicate,Description="evidence for alt allele is overrepresented by apparent duplicates">
##FILTER=<ID=fragment,Description="abs(ref - alt) median fragment length">
##FILTER=<ID=germline,Description="Evidence indicates this site is germline, not somatic">
##FILTER=<ID=haplotype,Description="Variant near filtered variant on same haplotype.">
##FILTER=<ID=low_allele_frac,Description="Allele fraction is below specified threshold">
##FILTER=<ID=map_qual,Description="ref - alt median mapping quality">
##FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD">
##FILTER=<ID=n_ratio,Description="Ratio of N to alt exceeds specified ratio">
##FILTER=<ID=normal_artifact,Description="artifact_in_normal">
##FILTER=<ID=orientation,Description="orientation bias detected by the orientation bias mixture model">
##FILTER=<ID=panel_of_normals,Description="Blacklisted site in panel of normals">
##FILTER=<ID=position,Description="median distance of alt variants from end of reads">
##FILTER=<ID=possible_numt,Description="Allele depth is below expected coverage of NuMT in autosome">
##FILTER=<ID=slippage,Description="Site filtered due to contraction of short tandem repeat region">
##FILTER=<ID=strand_bias,Description="Evidence for alt allele comes from one read direction only">
##FILTER=<ID=strict_strand,Description="Evidence for alt allele is not represented in both directions">
##FILTER=<ID=weak_evidence,Description="Mutation does not meet likelihood threshold">
... skip several lines ...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 117 118
... another skip ...
2 16080800 . GC G . PASS AS_FilterStatus=SITE;AS_SB_TABLE=2,46|5,5;DP=67;ECNT=2;GERMQ=52;MBQ=37,20;MFRL=178,248;MMQ=60,60;MPOS=28;NALOD=1.09;NLOD=5.67;POPAF=6.00;ROQ=93;RPA=7,6;RU=C;STR;STRQ=93;TLOD=14.58 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:28,10:0.176:38:13,4:14,6:1,27,5,5 0/0:20,0:0.042:20:8,0:11,0:1,19,0,0
```
You can find that Mutect2 report this site as a STR, and giving it a STRQ score of 93, then passed slippage filter among other filters.
But when we check the mapping status via IGV, it looks like a false positive call to us:
If our guess's right, it looks like a mapping error or something. We know that a tandem repeat like this can be hard for both sequencing and mapping, therefore leaving little chance for var calling tools to save the day.
Beside this site, there are several STRs who passed all filters yet looks wrong to us. And we noticed that all the STRs with one base change got the same STRQ of 93, discard all other differences there may be.
So advices needed: should we filter out all these single base STRs? Or should we filter them by an STRQ cutoff or something?
-
From the IGV screenshot I am not able to tell what the two tracks are. Can you please provide that info?
-
This site doesn't look so bad to me. The PCR slippage model relies on the empirical fact the indel error rate for STRs is usually lower than 1 in 10 reads. For most Illumina sequencing this is, in fact, overly conservative. The fact that the deletion is seen in 10 out of 38 tumor reads and in no normal reads leads me to trust it. The fact that you also see some insertion errors doesn't mean much. If your sequencing for some reason is more prone to PCR slippage at STRs than the assumptions of FilterMutectCalls you may increase the slippage rate parameter at the cost of many false negatives.
I also don't see a good reason to call this a mapping error because the errors are all at the STR site. Mapping artifacts almost always exhibit a bunch of scattered substitution errors.
The STRQ cap of 93 follows the cap on qual scores in the SAM/BAM spec.
I also notice a lot of non-default parameters in FilterMutectCalls, all set to very strict values. These parameters should only be adjusted if one has a very good reason to do so. Likewise, I recommend running Mutect2 and FilterMutectCalls with their default settings for STRs.
Please sign in to leave a comment.
2 comments