Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

[Repost] ReadGroup missing when Applying BQSR

Answered
0

44 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Robinn Teoh,

    Please see these two documents regarding BAM files and Read Groups

    It looks like there is an issue in your BAM file: 

    @RG\tID:PG3847_01_a\tLB:PG3847_01_A3535_H1\tPL:Illumina\tPU:H753JDSXY.2 \tPM:Novaseq6000\tSM:KuuKuu_tm_R1.fastq.gz Kuu_tm_R2.fastq.gz

    You will also want to look at your reads and determine which reads are not marked as the read group you isolated with FlagStat. I am guessing your issue is that the read group marking was not done properly and so many reads do not have a read group. Or, there is another related problem.

    Could you edit your above comment and remove the SQ lines so that it takes up less space?

    Thanks,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Robinn Teoh

    Dear Genevieve Brandt (she/her),

    I have edited my previous comment, and so sorry for causing the trouble!

    May I ask what issue that might be? Just to make a comparison I will attached the @RG for the BAM file which I had no trouble performing BQSR:

    @SQ SN:AAEX03026072.1 LN:1115
    @RG ID:DK20086_01 LB:DK20086_01 PL:MGISEQ PU:V300079475.1 PM:DNBSEQ SM:KurumiKurumi_R1.fastq.gz
    @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem cf3.1toplevel -t 64 -R @RG\tID:DK20086_01\tLB:DK20086_01\tPL:MGISEQ\tPU:V300079475.1\tPM:DNBSEQ\tSM:KurumiKurumi_R1.fastq.gz Kurumi_R2.fastq.gz
    @PG ID:GATK ApplyBQSR VN:4.1.9.0 CL:ApplyBQSR --output recal_reads_kurumi.bam --bqsr-recal-file recal_kurumi.table --input Kurumi_sorted_dedup.bam --reference cf3.1toplevel.fa --preserve-qscores-less-than 6 --use-original-qualities false --quantize-quals 0 --round-down-quantized false --emit-original-quals false --global-qscore-prior -1.0 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false PN:GATK ApplyBQSR
    @PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:samtools view -H recal_reads_kurumi.bam
    @PG ID:samtools.1 PN:samtools PP:GATK ApplyBQSR VN:1.10 CL:samtools view -H recal_reads_kurumi.bam

     

    I am trying to troubleshoot but my read group marking was done during bwa-mem alignment, which I have adhered to what samtools have suggested for SAM and BAM file creation. With the same way of entry, I got similar @RG in both my illumina and MGIseq BAM files, the only difference is that the former can't undergo recalibration. To my untrained eyes, the @RG for both of the BAM files seemed to be similar though... 

    Best wishes,

    Robinn 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Robinn,

    1. The issue I am referring to is the RG line that I highlighted in my last comment from your header. There are "\t" characters instead of tabs between the fields, indicating that something went wrong when you tried to add read groups. Here is a document on SAM/BAM format on our site, and the standard specification.
    2. Both of your headers seem to be similar... but I want to focus on the problem in your other file. When you ran FlagStat and isolated the only read group that exists in your file (H753JDSXY.2), none of the reads were actually marked as the read group.

    So, because of both problem #1 and problem #2, it is looking like your reads are not correctly marked with read groups. Please read our documentation on read groups and the other discussions on this forum for more information on read groups. Once they are correctly marked, I think you should have no issues with BQSR. 

    Thanks,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Robinn Teoh

    Dear Genevieve Brandt (she/her)

    For issue no.1:

    I have \t characters in my other bam files which have successful BQSRs, so I really do wonder if that is actually the main error though. When I use samtools with the log:

    samtools view -H Kuu_sorted_dedup.bam | grep '^@RG'       

    @RG ID:PG3847_01_a LB:PG3847_01_A3535_H1 PL:Illumina PU:H753JDSXY.2 PM:Novaseq6000 SM:KuuKuu_tm_R1.fastq.gz 

    the above line came up, so is \t really the problem?

    For issue no.2:

    I am trying really hard to detect which part of the reads aren't marked, and also trying to find out how exactly am I supposed to identify if my reads were marked appropriately through the forum in GATK but there isn't a clear cut answer to this, so I am planning to remark the reads after I have gotten the answer for my question below:

    As for marking read groups, may I know when should I be introducing read groups throughout the pipeline? The data preprocessing tutorial stated that read group is an expected requirement, but in nowhere does the tutorial state that where read groups should be first introduced. Should it be during BWA-MEM alignment, or is it probably better during SAM-BAM conversion etc? Maybe things could have gone wrong while marking the read groups from the reference alignment, I am not sure, so if you can guide me on this part it would be great.

    Thank you so much.

    Robinn 

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Robinn,

    For #1, I am not sure that it is the problem, the reason I brought it up is because it indicates that the read groups may have not been added correctly, since the format is wrong. 

    For #2, read groups should be added when you run BWA alignment. But you can add them after the fact with AddOrReplaceReadGroups.

    0
    Comment actions Permalink
  • Avatar
    Robinn Teoh

    Dear Genevieve, 

    For #1 I was looking on SAM specification but they didn't specify the format, but BWA-MEM homepage showed this:

    complete read group header line. ’\t’ can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’.

    so, I guess \t may not be the main problem?

     

    For #2, so after BAM file creation, I can still add read groups and it shouldn't be a problem right? But the best preference would still be during bwa alignment, correct? In that case is there a sample command which I can refer to so I can compare and know what is different from my command entry in my BWA-MEM, if possible, please?

    Thank you~

    Robinn 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Robinn Teoh,

    For #1, yes that is how you add read groups, but the \t should become a tab space if done correctly, not the actual "\t". That is why it indicates a problem.

    #2, it doesn't matter at which stage you add them, as long as they are there to run GATK. The documentation for Picard's AddOrReplaceReadGroups is here: http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups

    Best,

    Genevieve

     

    0
    Comment actions Permalink
  • Avatar
    Robinn Teoh

    Dear Genevieve Brandt (she/her),

    I have tried putting in the read groups from BWA-MEM and re run the whole pipeline, and here are my answers:

    For #1, I found out what was the issue and it came from the entry where I first invoke the read groups:

    bwa mem cf3.1toplevel -t 16 -R '@RG\tID:PG3847_01_a\tLB:PG3847_01_A3535_H1\tPL:Illumina\tPU:H753JDSXY.2 \tPM:Novaseq6000\tSM:Kuu'Kuu_tm_R1.fastq.gz Kuu_tm_R2.fastq.gz > Kuu.aln.sam

    there was actually a space between tPU:H753JDSXY.2 \tPM:Novaseq6000 which wasn't obvious when being post here on the forum, hence that may have caused the \t error seen. When I removed the space, and I retrieve the header:

    @RG ID:PG3847_03_a LB:PG3847_03_A3737_H1 PL:Illumina PU:H753JDSXY.2 PM:Novaseq6000 SM:Kaito

    The RG above was retrieved in the right format. 

    However, this didn't solve my problem as when I tried to apply BQSR the same problem where the read group H753JDSXY.2 is still missing in my recalibrated table from Base Recalibrator. Does this mean I failed to mark my read group on my reads again? Then I realized I could actually retrieve the details of the reads on the first line of my fastq file, which is derived on the line below:

    @A01084:55:H753JDSXY:2:1101:12255:1423 1:N:0:AGCCTCAT+AGTAGAGA

    which corresponds to:

    @machine id:run no:runid:laneid:titleno:x:y:filterpass:controlno:index seq

    With respect to:

    @RG\tID:$ID\tSM:$SM\tLB:$LB\tPU:$PU\tPL:$PL

    I would like to seek for your advice on how should I be labeling my read group with the header on my sequence, should my \tID be H753JDSXY or H753JDSXY.2 when I invoke the read group in BWA-MEM?

    I actually tried to change this (the @RG) using samtools reheader with my BAM files, but it doesn't allow downstream BaseRecalibrator or Mark Duplicate when I changed the @RG header, which I guess I should start the whole process from BWA-MEM again.

    From the read group tutorial it stated that both ID and PU should share the same read group, meaning in this case it should be H753JDSXY.2 since it should include lane number too right?

    It would be great if you can guide me through this, as the tutorial only showed example for a multiplex run but not single runs.

    Thank you.

    Robinn 

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Robinn,

    I am not sure what will be the best choice in your case, but I will add your question to our backlog for improvement of our read group documentation.

    To get BQSR working, could you check again the FlagStat command that we looked at here: https://gatk.broadinstitute.org/hc/en-us/community/posts/360076232572/comments/360014084971

    Please post the output of those commands.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Robinn Teoh

    Dear Genevieve Brandt (she/her)

    Here are the outputs of the commands:

    11:13:59.955 INFO FlagStat - 0 read(s) filtered by: WellformedReadFilter
    874579472 read(s) filtered by: ReadGroupReadFilter
    874579472 total reads filtered
    11:13:59.958 INFO ProgressMeter - unmapped 56.9 0 0.0
    11:13:59.958 INFO ProgressMeter - Traversal complete. Processed 0 total reads in 56.9 minutes.
    11:13:59.958 INFO FlagStat - Shutting down engine
    [February 19, 2021 11:13:59 AM JST] org.broadinstitute.hellbender.tools.FlagStat done. Elapsed time: 56.96 minutes.
    Runtime.totalMemory()=738721792
    Tool returned:
    0 in total
    0 QC failure
    0 duplicates
    0 mapped (�%)
    0 paired in sequencing
    0 read1
    0 read2
    0 properly paired (�%)
    0 with itself and mate mapped
    0 singletons (�%)
    0 with mate mapped to a different chr
    0 with mate mapped to a different chr (mapQ>=5)

     

    and for -RF NotSecondaryAlignmentReadFilter

    12:29:25.301 INFO FlagStat - ------------------------------------------------------------
    12:29:25.301 INFO FlagStat - The Genome Analysis Toolkit (GATK) v4.1.9.0
    12:29:25.301 INFO FlagStat - For support and documentation go to https://software.broadinstitute.org/gatk/
    12:29:25.301 INFO FlagStat - Executing as naika@DESKTOP-TMKR2LT on Linux v4.19.128-microsoft-standard amd64
    12:29:25.302 INFO FlagStat - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_275-8u275-b01-0ubuntu1~20.04-b01
    12:29:25.302 INFO FlagStat - Start Date/Time: February 19, 2021 12:29:24 PM JST
    12:29:25.302 INFO FlagStat - ------------------------------------------------------------
    12:29:25.302 INFO FlagStat - ------------------------------------------------------------
    12:29:25.317 INFO FlagStat - HTSJDK Version: 2.23.0
    12:29:25.317 INFO FlagStat - Picard Version: 2.23.3
    12:29:25.317 INFO FlagStat - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    12:29:25.317 INFO FlagStat - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    12:29:25.317 INFO FlagStat - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    12:29:25.317 INFO FlagStat - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    12:29:25.317 INFO FlagStat - Deflater: IntelDeflater
    12:29:25.317 INFO FlagStat - Inflater: IntelInflater
    12:29:25.317 INFO FlagStat - GCS max retries/reopens: 20
    12:29:25.317 INFO FlagStat - Requester pays: disabled
    12:29:25.317 INFO FlagStat - Initializing engine
    12:29:25.928 INFO FlagStat - Done initializing engine
    12:29:25.928 INFO ProgressMeter - Starting traversal
    12:29:25.928 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
    13:17:01.181 INFO FlagStat - 0 read(s) filtered by: WellformedReadFilter
    874579472 read(s) filtered by: ReadGroupReadFilter
    0 read(s) filtered by: NotSecondaryAlignmentReadFilter
    874579472 total reads filtered
    13:17:01.182 INFO ProgressMeter - unmapped 47.6 0 0.0
    13:17:01.182 INFO ProgressMeter - Traversal complete. Processed 0 total reads in 47.6 minutes.
    13:17:01.182 INFO FlagStat - Shutting down engine
    [February 19, 2021 1:17:01 PM JST] org.broadinstitute.hellbender.tools.FlagStat done. Elapsed time: 47.61 minutes.
    Runtime.totalMemory()=715653120
    Tool returned:
    0 in total
    0 QC failure
    0 duplicates
    0 mapped (�%)
    0 paired in sequencing
    0 read1
    0 read2
    0 properly paired (�%)
    0 with itself and mate mapped
    0 singletons (�%)
    0 with mate mapped to a different chr
    0 with mate mapped to a different chr (mapQ>=5)

    I will also begin a new BWA-MEM alignment again just to be safe, with ID and PU sharing the same input for my read group. 

    Robinn 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Robinn,

    It looks like again all of your reads are filtered out because they do not match the read group you filtered for.

    Once the ReadGroupReadFilter is applied, 0 reads remain.

    In our documentation about BAM files, there is a section on viewing the read group on a specific line of your BAM [More about Read Groups]. Could you look at your BAM file and see what read group is there?

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Alex Falconer

    Maybe someone will come in handy ...

    samtools reheader only changes the header.

    You needed to compare the results of the commands:
    samtools view -H Kuu_sorted_dedup.bam | grep '^@RG'
    samtools view Kuu_sorted_dedup.bam | grep -m1 'RG:.:.'

    After, if necessary, use picard AddOrReplaceReadGroups

    java -jar picard.jar AddOrReplaceReadGroups \
    I=Kuu_sorted_dedup.bam \
    O=Kuu_sorted_dedup_newRG.bam \
    RGID=${ID} \ # that is, RGID will be RG:Z:${ID} !!!
    RGLB=${LB} \
    RGPL=${PL} \
    RGPU=${PU} \ # does not work without RGPU !!!
    RGSM=${SM}

    0
    Comment actions Permalink
  • Avatar
    Alex Falconer

    ... and WSL is not recommended for production. Results may differ from real Linux (tested in WSL1).

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you for posting your solution to this post Alex Falconer!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk