artifacts called by Mutect2 on RNASeq data
Hi,
I'm using Mutect2 v4.4.0.0 to call variants on RNASeq data without a matched reference. I'm following GATK best practices on RNAseq short variant discovery (https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-) but using Mutect2 instead of HaplotypeCaller to call somatic variants from tumor samples. Looking at IGV, most of the variants seem to have been called properly (you can see them in the BAM and in the VCF), but there are several variants that are called outside of the aligned regions (705 PASS variants out of 6508 total PASS variants). If you keep only SNPs with depth equal or higher than 10, then only 7 variants are called outside of the BAM. All of them are located very close to an aligned region and I've realized that in most cases the region where they are located is very similar in 3 or 4 bp with the beginning of the next aligned region, so I guess the error comes from here. I attach a picture to show what I mean.On the image, a substitution is detected on the first base of the intron, which is a G in the reference but a C in the sample. The 4 first bases of the next exon are the same as the 4 first bases of the intron, except for the G, which is a C here. Three bases upstream, it detects a large insertion, which maps perfectly on the next exon.
Have you detected these type of errors when performing variant calling with Mutect2 on RNAseq data? Do you have any suggestion on how it could be avoided?
Thank you very much,
Alba
-
Unfortunately we don't have much experience with RNA calling but it looks to me like the reads are not getting split properly. I mean, if Mutect2 is calling a read that spans your two exons against the full reference containing introns, it doesn't know anything about RNA and simply sees a read that goes a few bases past the end of the first exon into the intron and has no choice but to call a SNP and an indel.
I am not an expert but I think that somewhere the pipeline ought to be splitting that read into a chunk along the first exon and a chunk along the second.
-
Hi David,
thanks for your reply. I agree that it looks like the reads are not getting split properly, but I actually perform a step to split the exons after alignment with GATK4 SplitNCigarReads, as recommended in GATK Best Practices. So what I give as input to Mutect2 is the recalibrated CRAM obtained after alignment with STAR, removal of Ns with SplitNCigarReads and base quality score recalibration, and, as you can see in the second track of the IGV image, the exons are split properly.
Looking at GATK documentation on how variant calling works, I guess that the errors occur during Mutect's reassembly of the active regions. I wonder if this could be optimized by modifying parameters such as -kmerSize or -minPruning, but I'm not sure how this is going to impact the variant calling accuracy. Do you have any suggestions on how to modify these parameters?
Thank you very much,
Alba
-
Hi Alba,
We don't have nearly enough experience with RNA but I think the best way to get a few guesses would be to generate a bamout via the -bamout bamout.bam argument, and then post an IGV screenshot of the realigned reads in bamout.bam. This way we will see how parts of reads are getting clipped and moved around.
Please sign in to leave a comment.
3 comments