I've been trying to analyze an older WGS dataset of FASTQ's. I aligned the FASTQ's using BWA-mem to the hg38 reference genome, during this process I didn't add the read group information. I was working on marking duplicates using GATK but realized that a read group is essential for the process. Later, I look at my FASTQ's and counted to unique instrument names (sequence identifiers) E.g "@SXX191512" and there is more than one in for each FASTQ. I believe that the '@SXX191512' is required in creating a read group and adding it to the BAM files. Since there are multiple sequence identifiers, can I use anyone identifier to create a read group? Does this have any impact on duplicate markings?
Please sign in to leave a comment.