I am using GATK 18.104.22.168, and am analyzing 4 samples of one non-model species (Illumina data), sequenced on the same flowcell, in the same lane. The files I have were provided demultiplexed, and the pipeline I am using operates separately on each file, i.e. bwa -> indexing/markduplicates -> individualvcf -> jointvcf -> individualbqsr -> individual_re_vcf, and so forth. My question is about read groups in this context.
I had originally designated the RGID to be the same for all 4 samples, following the ID definition from this post: https://gatkforums.broadinstitute.org/gatk/discussion/11015/read-groups, and making sample the unique identifier. For example,
with SM being the unique identifier for each of the four sample files.
However, in another forum post (https://gatkforums.broadinstitute.org/gatk/discussion/2801/howto-recalibrate-base-quality-scores-run-bqsr) the author in the 9th discussion post sets read groups the same way I have, but the reply says that is wrong, because all IDs "need to be different," and I can't figure out explicitly whether there is still a problem for me, given I have set up the pipeline to operate on each file separately (which, it would seem to me, is an effective way to ensure error structures are not comingled in a way that might be true if different samples from different lanes or flow cells were merged into a single file). In my pipeline, genotypegvcfs is the only merge point, merged using CombineGVCFs using multiple -V inputs.
My questions are:
(1) Why would IDs need to be unique if samples are already unique, in cases where everything was sequenced in the same flow cell, in the same lane, at the same time (one covariance/error structure, I would think),
(2) If in general nonunique RGID is a problem even in the case I've mentioned, does keeping the files separated in the manner that I have them throughout the process end up amounting to an acceptable solution?
(3) If it is an acceptable solution in theory, but there is a special step that must be done at the merge in CombineGVCFs/GenotypeGVCFs or a BQSR step to make sure sample is the unique designator, what is/are they?
Please sign in to leave a comment.