Hello everyone! Please excuse me if this question is naïve: I'm still new to bioinformatics and GATK.
I am using the GATK4 suite to ultimately call germline variants on whole exome sequencing data, obtained from an Illumina NextSeq 550 sequencer. For a variety of reasons I cannot use the WDL/Cromwell setup recommended by the Best Practices, so I am trying to replicate the recommended workflow in Bash.
I would like to speed up the BQSR step by employing the Scatter / Gather strategy. However, studying this article (https://gatk.broadinstitute.org/hc/en-us/articles/360035890531-Base-Quality-Score-Recalibration-BQSR-), I've realized that BaseRecalibrator requires a lot of data to build a proper statistical model.
My question: is it okay to scatter the BaseCalibrator job by chromosome if I analyze just one WES sample at a time? (I know that downstream I will need to perform joint genotyping with 30+ samples, but at the moment I'm preparing single-sample BAM files one-by-one).
The article above says that BaseRecalibrator expects each read group to have at least 100M bases. Calculated naively, PF_HQ_ALIGNED_BASES / 23 = 215+ megabases (the metric is from the CollectAlignmentSummaryMetrics output).
Please sign in to leave a comment.