Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Best practice with read groups


1 comment

  • Avatar
    Laura Gauthier

    Hi Sheryl,

    MarkDuplicates specifically may not use the read group information, but BQSR definitely does.  We recommend creating a different FASTQ file for each read group and then converting those to an unaligned BAM that will retain the readgroup information.  You start from paired or interleaved FASTQs -- i.e. paired FASTQs have separate files for each read in the pair and interleaved have both reads in the pair one after another in the same file. If you already have interleaved FASTQs you'll have to split them (see You'll also need to split the fastqs by flowcell-lane because the workflow for this purpose expects one read group per fastq. That should be doable with a relatively easy (but probably long-running) Python script. The read names have the flowcell and lane in them.

    You'll need a TSV with the following :

    This example is for paired FASTQ files, but 
    Platform name is the technology used to produce the reads (i.e. illumina)
    Platform Unit should be unique to each read group, i.e. flowcell.lane.barcode)

    Once you have that TSV you can run the commands as in to create an unmapped BAM that will have all the information for BQSR.


