Should I perform BQSR if data is from a non-model organism and has ~10x more variation than humans?
Hi there. I'm currently working through GATK's data pre-processing pipeline (mapping, marking duplicates, etc.) to ultimately detect germline SNPs using GATK's pipeline. However, we are working with DNA-seq data from a non-model, diploid organism (a type of clam) which has ~10x greater genetic variation than humans and a good amount of repeat content.
In reviewing recent literature, best practices seem to be split as some researchers working with a similar species choose to perform BQSR while others do not. Additionally, some colleagues have put forth that they removed it from their pipeline about two years ago, so I'm wondering if this publishing delay is causing the literature to be convoluted.
For context, if we do have to perform BQSR, we will have to bootstrap because we have no high-confidence known sites for recalibration (which is fine, but before I begin this process I'd like to know whether or not this is a scenario which calls for BQSR).
Thank you for your help.
-
BQSR and its usefulness is a rich discussion topic in the field however one thing to say about it is that not every sequencing device is made equal. Therefore it is up to the user to decide if that would be the best of interest depending on the sample/tech variety, informatics workflow, built-in recalibration tools from the manufacturer etc.
Bootstrapping might be your only option however if you happen to have samples from diverse set of machines, times and technologies it may be better to recalibrate your basecall qualities to match each sample together. If you are getting your samples from the same single machine and from a single run or multiple runs with close basecall metrics then It may not be necessary. Certain device/vendor integrated proprietary informatics technologies may have integrated BQSR in different forms but not as a separate step in the workflow.
I hope this helps.
Please sign in to leave a comment.
1 comment