Our Best Practices pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.
However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs (RGID).
At that point there are several different valid strategies for implementing the pre-processing workflow. Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping and sorting) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done conveniently at the same time that we do the duplicate marking, by running Mark Duplicates on all read group BAM files for a sample at the same time. Then we run Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below for more details.
Example
Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.
Let's say we have this example data (assuming interleaved FASTQs containing both forward and reverse reads) for two sample libraries, sampleA and sampleB, which were each sequenced on two lanes, lane1 and lane2:
- sampleA_lane1.fq
- sampleA_lane2.fq
- sampleB_lane1.fq
- sampleB_lane2.fq
These will each be identified as separate read groups A1, A2, B1 and B2. If we had multiple libraries per sample, we would further distinguish them (eg sampleA_lib1_lane1.fq leading to read group A11, sampleA_lib2_lane1.fq leading to read group A21 and so on).
1. Run initial steps per-readgroup once
Assuming that you received one FASTQ file per sample library, per lane of sequence data (which amounts to a read group), run each file through mapping and sorting. During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the read groups dictionary entry for guidance.
The example data becomes:
- sampleA_rgA1.bam
- sampleA_rgA2.bam
- sampleB_rgB1.bam
- sampleB_rgB2.bam
Note that we used to do a first round of marking duplicates here for QC purposes but tool improvements have rendered this obsolete.
2. Merge read groups and mark duplicates per sample (aggregation + dedup)
Once you have pre-processed each read group individually, you merge read groups belonging to the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of efficiency we combine this with the per-sample duplicate marking step (it's simply a matter of passing the multiple inputs to MarkDuplicates in a single command).
The example data becomes:
- sampleA.merged.dedup.bam
- sampleB.merged.dedup.bam
This process of marking duplicates eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).
3. Remaining per-sample pre-processing
Then all that's left to run is base recalibration (BQSR). We used to have another step here called indel realignment, but assuming you're using a modern caller like HaplotypeCaller or Mutect2, you don't need to do it. You would only do so if you were using a locus-based variant caller like MuTect 1 or UnifiedGenotyper, but why would you want to do that? (You really don't)
The example data becomes:
- sample1.merged.dedup.recal.bam
- sample2.merged.dedup.recal.bam
Base recalibration will be applied per-read group if you assigned appropriate read group information in your data. BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID). This will identify separate read groups (distinguishing both lanes and libraries) as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine (assuming the equipment is Illumina HiSeq or similar technology).
3 comments
Thanks for the great explanation.
Regarding the following line:
If I am interested in the per-lane statistics (namely how many duplicates per lanes), how can they be extracted if MarkDuplicates is executed only once when merging the lanes into a single BAM?
Looking at the metrics file (I'm using gatk v4.1.7.0), I see that the results are per library. Does it mean that if I want per-lane statistics, I should modify the read group so each lane will have a distinct library (currently the all have the same library)?
Finally I found this thread.
This is exactly what confused me... I tried this while the MarkDuplicates will discard the RG info and I cannot proceed...
What is the recommended workflow if we study non-model organisms? I don't have a set of known SNPs I can use in BQSR, so I was going to proceed straight to HaploCaller... but then I don't understand what comes next. Would I use CombineGVCFs twice, once to combine lanes within a sample, and then to combine samples? How does this influence downstream analyses, e.g., to calculate depth of coverage per individual at a specific loci, can I access that information easily or are the lanes still treated individually in the final dataset?
Please sign in to leave a comment.