There is no formal definition of what a 'read group' is, however in practice this term refers to a set of reads that are generated from a single run of a sequencing instrument.
In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flow cell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.
Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.
To see the read group information for a BAM file, use the following command.
samtools view -H sample.bam | grep '^@RG'
This prints the lines starting with @RG
within the header, e.g. as shown in the example below.
@RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI
Meaning of the read group fields required by GATK
ID
= Read group identifier
This tag identifies which read group each read belongs to, so each read group'sID
must be unique. It is referenced both in the read group definition line in the file header (starting with@RG
) and in theRG:Z
tag for each read record. Note that some Picard tools have the ability to modifyID
s when merging SAM files in order to avoid collisions. In Illumina data, read groupID
s are composed using the flowcell name and lane number, making them a globally unique identifier across all sequencing data in the world.
Use for BQSR:ID
is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration (unless you have PU defined), since they are assumed to share the same error model.PU
= Platform Unit
ThePU
holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although thePU
is not required by GATK but takes precedence overID
for base recalibration if it is present. In the example shown earlier, two read group fields,ID
andPU
, appropriately differentiate flow cell lane, marked by.2
, a factor that contributes to batch effects.SM
= Sample
The name of the sample sequenced in this read group. GATK tools treat all read groups with the sameSM
value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it is critical that theSM
field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name.PL
= Platform/technology used to produce the read
This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.LB
= DNA preparation library identifier
MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.
If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.
Deriving ID
and PU
fields from read names
Here we illustrate how to derive both ID
and PU
fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.
H0164ALXX140820:2:1101:10003:23460 H0164ALXX140820:2:1101:15118:25288
Breaking down the common portion of the query names:
H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell _____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run _______________:2 #portion of @RG ID and PU fields indicating flow cell lane
Multi-sample and multiplexed example
Suppose that you have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data comes off the sequencer, you would create 12 BAM files, with the following @RG
fields in the header:
Dad's data:
@RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
Mom's data:
@RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 @RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
Kid's data:
@RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400 @RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
Note that the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).
5 comments
hello, when I used gatk MarkDuplicates command to mark duplicates for my bam file(after I used the bwa mem command ),the code as in the folllowing:
$bwa mem -t 10 \
-M -R "@RG\tID:${ID}\\tSM:${ID}\\tLB:Targeted\\tPL:Illumina"
$wkdir1/5.ref/hg_38/hg38.fa \
$wkdir1/3.clean/Yangliying/fastp/${sample}_R1.fastp.fastq.gz \
$wkdir1/3.clean/Yangliying/fastp/${sample}_R2.fastp.fastq.gz \
| samtools sort -@ 10 -o $wkdir1/4.align/bwa/Yangliying/test1/${ID}.bam -\
2>$wkdir1/4.align/bwa/Yangliying/test1/${ID}.sorted.log
$gatk MarkDuplicates \
-I /home/data/vip8t13/wes_pro1/4.align/bwa/Yangliying/R18067578LU01.bam \
-M /home/data/vip8t13/wes_pro1/6.gatk/Yangliying/R18067578LU01.markdup_metrics.txt \
-O /home/data/vip8t13/wes_pro1/6.gatk/Yangliying/R18067578LU01.sort.markdup.bam
$ samtools view -H R18067578LU01.bam | grep '^@RG'
@RG ID:R18067578LU01 SM:R18067578LU01 LB:Targeted PL:Illumina
$ samtools view -H R18067578LU01.bam | grep '^@PG'
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 10 -M -R @RG\tID:R18067578LU01\tSM:R18067578LU01\tLB:Targeted\tPL:Illumina /home/data/vip8t13/wes_pro1/5.ref/hg_38/hg38.fa /home/data/vip8t13/wes_pro1/3.clean/Yangliying/R18067578LU01-Yangliying_R1_val_1.fq.gz /home/data/vip8t13/wes_pro1/3.clean/Yangliying/R18067578LU01-Yangliying_R2_val_2.fq.gz
@PG ID:samtools PN:samtools PP:bwa VN: CL:samtools sort -@ 10 -o /home/data/vip8t13/wes_pro1/4.align/bwa/Yangliying/R18067578LU01.bam -
@PG ID:samtools.1 PN:samtools PP:samtools VN: CL:samtools view -H R18067578LU01.bam
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMFormatException: Error parsing SAM header. Problem parsing @PG key:value pair. Line:
@PG ID:samtools PN:samtools PP:bwa VN: CL:samtools sort -@ 10 -o /home/data/vip8t13/wes_pro1/4.align/bwa/Yangliying/R18067578LU01.bam -; File /home/data/vip8t13/wes_pro1/4.align/bwa/Yangliying/R18067578LU01.bam; Line number 644
at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(SAMTextHeaderCodec.java:258)
at htsjdk.samtools.SAMTextHeaderCodec.access$200(SAMTextHeaderCodec.java:46)
at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.<init>(SAMTextHeaderCodec.java:307)
at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:97)
at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:704)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:406)
at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:265)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:507)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:258)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Hi
How find RGID,RGSM of addorreadgroup for rnaseq data of SRA?
I only know the platform.
Just wanted to confirm that ONT is not a valid value for PL. Thanks.
**Edited for clarity 4/28/2023**
Hello,
I am hoping to get some clarification on the GATK read group classifications. Three questions:
1. We're confused about what sample hierarchy information RGSM should contain. In the example for this page, SM refers to the patient and LB refers to the library, but what should SM be in the following scenarios:
2. I see that RGPU is not required by GATK. Can you provide an example of when it would be appropriate to specify RGPU over RGID?
3. Could you point me to documentation/code on how each read group is used?
Any thoughts on the above would be much appreciated.
Thanks!
The RGPU field is inconsistent within this document where it is specified as consisting of "{FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}". However, the example RGPU "PU:H0164ALXX140820.2" does not match this format. Additionally, neither of these match the SAM format specification version 0dd3e0d which describes the PU as: "Platform unit (e.g., flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identifier."
Can you provide some clarification for the RGPU field?
Please sign in to leave a comment.