Interval list and the “Mate not found” error in WES data
Hello, I’m confused about whether interval list should be used in the data pre-processing workflow for whole exome sequencing data.
According to the GATK documentation When should I restrict my analysis to specific intervals?, exome analysis should include list of targets with padding, so as to exclude off-target noise. The documentation also listed BQSR as one of the steps that should be run with interval list, because “the off-target sequencing data is uninformative and is a source of noise, therefore it should be eliminated.”
However, when I provided an interval list (from the manufacturer, with 100bp padding) for both BaseRecalibrator and ApplyBQSR, the resulting BAM failed to pass ValidateSamFile and threw multiple errors like this:
ERROR::MATE_NOT_FOUND:Read name (Some read name), Mate not found for paired read
While I was searching for solution, I saw this post in which someone has met the similar problem. I’m still confused after reading the answers in that post. What I got from the answers in that post is:
- Provides interval list in BQSR is just for scatter-gather parallelism, not for the subsetting of specific genomic regions.
- BQSR should be run on all reads that contribute to the model.
If I didn’t get it wrong, aren’t these points a bit contradictory to the GATK documentation regarding interval restriction?
Additional questions regarding this issue:
- If the off-target reads are merely noises, should I still include them in the BQSR model?
- If a read was included in the intervals while its mate was not, can I eliminate both reads, just like what “—SANITIZE true” will do in RevertSam?
- Is there any tool other than RevertSam that can eliminate reads with missing mates? I’ve tried PrintReads with “ProperlyPairedReadFilter”, followed by FixMateInformation, but the outputs from these two steps still report the missing mate error.
If I ran BQSR and ApplyBQSR without intervals, then no error occurred, however, the time of processing become extremely long.
Thank you for your time, any help will be much appreciated!
-
Hi LY Wang,
You are right the the documentation here is a bit unclear, and seemingly contradictory. The solution is found in one of the comments on the other issue you linked to, here.
With BQSR, intervals can be used while running BaseRecalibrator but not ApplyBQSR.
In general though, your understanding is correct that we don't tend to subset to exome regions when running BQSR, but instead use the subsetting for scatter-gather parallelism (which then requires using GatherBQSRReports to combine the resulting reports into one).
I think that the effect of subsetting BaseRecalibrator to the targets is quite minimal, thus why we don't tend to do it. However, if you do subset to the targets, to important point is to only subset to the targets for BaseRecalibrator, but keep all reads when you run ApplyBQSR. -
Thank you very much for your suggestion, Chris Kachulis!
I re-ran ApplyBQSR without specifying intervals and it works! No error was found after ValidateSamFile.
I don't have further questions for now, but as a newbie in bioinformatics, I'm still hoping that GATK could update the best practice for WES analysis one day (the old one seems to be archived).
Again, thank you for your help and all the hard work of your team!
Please sign in to leave a comment.
2 comments