Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Read groups Follow


  • Avatar
    lee Ethan

    hello, when I used gatk MarkDuplicates command to mark duplicates for my bam file(after I used the bwa mem command ),the code as in the folllowing:

    $bwa mem -t 10 \
    -R "@RG\tID:${ID}\\tSM:${ID}\\tLB:Targeted\\tPL:Illumina"
                            $wkdir1/5.ref/hg_38/hg38.fa \
                                   $wkdir1/3.clean/Yangliying/fastp/${sample}_R1.fastp.fastq.gz \
                               $wkdir1/3.clean/Yangliying/fastp/${sample}_R2.fastp.fastq.gz \
                            | samtools sort -@ 10 -o   $wkdir1/4.align/bwa/Yangliying/test1/${ID}.bam -\

    $gatk MarkDuplicates \

    -I /home/data/vip8t13/wes_pro1/4.align/bwa/Yangliying/R18067578LU01.bam \

    -M /home/data/vip8t13/wes_pro1/6.gatk/Yangliying/R18067578LU01.markdup_metrics.txt \

    -O /home/data/vip8t13/wes_pro1/6.gatk/Yangliying/R18067578LU01.sort.markdup.bam

    $ samtools view -H R18067578LU01.bam | grep '^@RG'
    @RG     ID:R18067578LU01        SM:R18067578LU01        LB:Targeted     PL:Illumina
    $ samtools view -H R18067578LU01.bam | grep '^@PG'
    @PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -t 10 -M -R @RG\tID:R18067578LU01\tSM:R18067578LU01\tLB:Targeted\tPL:Illumina /home/data/vip8t13/wes_pro1/5.ref/hg_38/hg38.fa /home/data/vip8t13/wes_pro1/3.clean/Yangliying/R18067578LU01-Yangliying_R1_val_1.fq.gz /home/data/vip8t13/wes_pro1/3.clean/Yangliying/R18067578LU01-Yangliying_R2_val_2.fq.gz
    @PG     ID:samtools     PN:samtools     PP:bwa  VN:     CL:samtools sort -@ 10 -o /home/data/vip8t13/wes_pro1/4.align/bwa/Yangliying/R18067578LU01.bam -
    @PG     ID:samtools.1   PN:samtools     PP:samtools     VN:     CL:samtools view -H R18067578LU01.bam

    To get help, see
    htsjdk.samtools.SAMFormatException: Error parsing SAM header. Problem parsing @PG key:value pair. Line:
    @PG     ID:samtools     PN:samtools     PP:bwa  VN:     CL:samtools sort -@ 10 -o /home/data/vip8t13/wes_pro1/4.align/bwa/Yangliying/R18067578LU01.bam -; File /home/data/vip8t13/wes_pro1/4.align/bwa/Yangliying/R18067578LU01.bam; Line number 644
            at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(
            at htsjdk.samtools.SAMTextHeaderCodec.access$200(
            at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.<init>(
            at htsjdk.samtools.SAMTextHeaderCodec.decode(
            at htsjdk.samtools.BAMFileReader.readHeader(
            at htsjdk.samtools.BAMFileReader.<init>(
            at htsjdk.samtools.BAMFileReader.<init>(
            at htsjdk.samtools.SamReaderFactory$
            at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(
            at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(
            at picard.sam.markduplicates.MarkDuplicates.doWork(
            at picard.cmdline.CommandLineProgram.instanceMain(
            at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(
            at org.broadinstitute.hellbender.Main.runCommandLineProgram(
            at org.broadinstitute.hellbender.Main.mainEntry(
            at org.broadinstitute.hellbender.Main.main(

    Comment actions Permalink
  • Avatar
    maryam montazeri

    How find RGID,RGSM of addorreadgroup for rnaseq data of SRA?
    I only know the platform.

    Comment actions Permalink
  • Avatar
    Shin Lin

    Just wanted to confirm that ONT is not a valid value for PL.  Thanks.

    Comment actions Permalink
  • Avatar
    Layne Rogers

    **Edited for clarity 4/28/2023**


    I am hoping to get some clarification on the GATK read group classifications. Three questions:

    1. We're confused about what sample hierarchy information RGSM should contain. In the example for this page, SM refers to the patient and LB refers to the library, but what should SM be in the following scenarios:

    • Tumor and germline DNA were collected from two patients each -- is SM the patient or is SM the patient + tissue source i.e. tumor or normal?
    • If we just have tumor DNA collected from two patients and each library was sequenced across two flowcells, would LB be the library name + flowcell ID?

    2. I see that RGPU is not required by GATK. Can you provide an example of when it would be appropriate to specify RGPU over RGID?

    3. Could you point me to documentation/code on how each read group is used?

    Any thoughts on the above would be much appreciated.



    Comment actions Permalink
  • Avatar
    Trevor Freeman

    The RGPU field is inconsistent within this document where it is specified as consisting of "{FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}". However, the example RGPU "PU:H0164ALXX140820.2" does not match this format. Additionally, neither of these match the SAM format specification version 0dd3e0d which describes the PU as: "Platform unit (e.g., flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identifier."

    Can you provide some clarification for the RGPU field?

    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk