[Repost] ReadGroup missing when Applying BQSR
AnsweredDear all,
I am currently using gatk4.1.9.0 to run BQSR on my sorted and deduped bam files, which I have had no problems with my previous 2 files, but having the error as stated below:
Command inserted:
./gatk ApplyBQSR -R cf3.1toplevel.fa -I Kuu_sorted_dedup.bam -bqsr recal_kuu.table -O recal_reads_kuu.bam
Error log:
$ ./gatk ApplyBQSR -R cf3.1toplevel.fa -I Kuu_sorted_dedup.bam -bqsr recal_kuu.table -O recal_reads_kuu.bam
Using GATK jar /mnt/d/Docker/WGS/gatk4/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /mnt/d/Docker/WGS/gatk4/gatk-package-4.1.9.0-local.jar ApplyBQSR -R cf3.1toplevel.fa -I Kuu_sorted_dedup.bam -bqsr recal_kuu.table -O recal_reads_kuu.bam
12:30:09.851 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/d/Docker/WGS/gatk4/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jan 01, 2021 12:30:09 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
12:30:09.993 INFO ApplyBQSR - ------------------------------------------------------------
12:30:09.993 INFO ApplyBQSR - The Genome Analysis Toolkit (GATK) v4.1.9.0
12:30:09.993 INFO ApplyBQSR - For support and documentation go to https://software.broadinstitute.org/gatk/
12:30:09.993 INFO ApplyBQSR - Executing as naika@DESKTOP-TMKR2LT on Linux v4.19.128-microsoft-standard amd64
12:30:09.993 INFO ApplyBQSR - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_275-8u275-b01-0ubuntu1~20.04-b01
12:30:09.993 INFO ApplyBQSR - Start Date/Time: January 1, 2021 12:30:09 PM JST
12:30:09.993 INFO ApplyBQSR - ------------------------------------------------------------
12:30:09.993 INFO ApplyBQSR - ------------------------------------------------------------
12:30:09.994 INFO ApplyBQSR - HTSJDK Version: 2.23.0
12:30:09.994 INFO ApplyBQSR - Picard Version: 2.23.3
12:30:09.994 INFO ApplyBQSR - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:30:09.994 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:30:09.994 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:30:09.994 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:30:09.994 INFO ApplyBQSR - Deflater: IntelDeflater
12:30:09.994 INFO ApplyBQSR - Inflater: IntelInflater
12:30:09.994 INFO ApplyBQSR - GCS max retries/reopens: 20
12:30:09.994 INFO ApplyBQSR - Requester pays: disabled
12:30:09.994 INFO ApplyBQSR - Initializing engine
12:30:10.354 INFO ApplyBQSR - Done initializing engine
12:30:10.407 INFO ProgressMeter - Starting traversal
12:30:10.407 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
12:30:10.619 INFO ApplyBQSR - Shutting down engine
[January 1, 2021 12:30:10 PM JST] org.broadinstitute.hellbender.tools.walkers.bqsr.ApplyBQSR done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=982515712
java.lang.IllegalStateException: The covariates table is missing ReadGroup H753JDSXY.2 in RecalTable0
at org.broadinstitute.hellbender.utils.Utils.validate(Utils.java:750)
at org.broadinstitute.hellbender.utils.recalibration.covariates.ReadGroupCovariate.keyForReadGroup(ReadGroupCovariate.java:81)
at org.broadinstitute.hellbender.utils.recalibration.covariates.ReadGroupCovariate.recordValues(ReadGroupCovariate.java:53)
at org.broadinstitute.hellbender.utils.recalibration.covariates.StandardCovariateList.recordAllValuesInStorage(StandardCovariateList.java:133)
at org.broadinstitute.hellbender.utils.recalibration.RecalUtils.computeCovariates(RecalUtils.java:546)
at org.broadinstitute.hellbender.utils.recalibration.RecalUtils.computeCovariates(RecalUtils.java:527)
at org.broadinstitute.hellbender.transformers.BQSRReadTransformer.apply(BQSRReadTransformer.java:145)
at org.broadinstitute.hellbender.transformers.BQSRReadTransformer.apply(BQSRReadTransformer.java:27)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:94)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1049)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
To be safe, I will attach the log of the recal.table when I first created the table as well as the command:
Command:
./gatk BaseRecalibrator -R cf3.1toplevel.fa -I Kuu_sorted_dedup.bam --known-sites bqsr_kuu_snps.vcf --known-sites bqsr_kuu_indels.vcf -O recal_kuu.table
Completion log:
12:09:57.665 INFO ProgressMeter - AAEX03025150.1:3340 121.9 524369000 4301180.3
12:10:12.594 INFO BaseRecalibrator - 43702519 read(s) filtered by: MappingQualityNotZeroReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
120929799 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: WellformedReadFilter
164632318 total reads filtered
12:10:12.595 INFO ProgressMeter - AAEX03026059.1:1598 122.2 524603691 4294340.3
12:10:12.595 INFO ProgressMeter - Traversal complete. Processed 524603691 total reads in 122.2 minutes.
12:10:12.614 INFO BaseRecalibrator - Calculating quantized quality scores...
12:10:12.621 INFO BaseRecalibrator - Writing recalibration report...
12:10:13.454 INFO BaseRecalibrator - ...done!
12:10:13.454 INFO BaseRecalibrator - BaseRecalibrator was able to recalibrate 524603691 reads
12:10:13.455 INFO BaseRecalibrator - Shutting down engine
[January 1, 2021 12:10:13 PM JST] org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator done. Elapsed time: 122.19 minutes.
Runtime.totalMemory()=693108736
Tool returned:
SUCCESS
I have had no problem with my previous ApplyBQSR with the same flow, same parameters and entry command, but somehow I am having issues with this current run. At the same time, I realize I face this problem with all 3 WGS data retrieved from an Illumina platform but not on the data from MGI platform. The flow from alignment, samtools fix, samtools sort and picard mark duplicates were all the same, and when I check if my bam files were corrupted using samtools the results came out clean.
Any ideas what might be wrong, seeing I actually have 524603691 reads recalibrated?
Thanks.
-
Hi Robinn Teoh,
Please see these two documents regarding BAM files and Read Groups.
It looks like there is an issue in your BAM file:
@RG\tID:PG3847_01_a\tLB:PG3847_01_A3535_H1\tPL:Illumina\tPU:H753JDSXY.2 \tPM:Novaseq6000\tSM:KuuKuu_tm_R1.fastq.gz Kuu_tm_R2.fastq.gz
You will also want to look at your reads and determine which reads are not marked as the read group you isolated with FlagStat. I am guessing your issue is that the read group marking was not done properly and so many reads do not have a read group. Or, there is another related problem.
Could you edit your above comment and remove the SQ lines so that it takes up less space?
Thanks,
Genevieve
-
Dear Genevieve Brandt (she/her),
I have edited my previous comment, and so sorry for causing the trouble!
May I ask what issue that might be? Just to make a comparison I will attached the @RG for the BAM file which I had no trouble performing BQSR:
@SQ SN:AAEX03026072.1 LN:1115
@RG ID:DK20086_01 LB:DK20086_01 PL:MGISEQ PU:V300079475.1 PM:DNBSEQ SM:KurumiKurumi_R1.fastq.gz
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem cf3.1toplevel -t 64 -R @RG\tID:DK20086_01\tLB:DK20086_01\tPL:MGISEQ\tPU:V300079475.1\tPM:DNBSEQ\tSM:KurumiKurumi_R1.fastq.gz Kurumi_R2.fastq.gz
@PG ID:GATK ApplyBQSR VN:4.1.9.0 CL:ApplyBQSR --output recal_reads_kurumi.bam --bqsr-recal-file recal_kurumi.table --input Kurumi_sorted_dedup.bam --reference cf3.1toplevel.fa --preserve-qscores-less-than 6 --use-original-qualities false --quantize-quals 0 --round-down-quantized false --emit-original-quals false --global-qscore-prior -1.0 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false PN:GATK ApplyBQSR
@PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:samtools view -H recal_reads_kurumi.bam
@PG ID:samtools.1 PN:samtools PP:GATK ApplyBQSR VN:1.10 CL:samtools view -H recal_reads_kurumi.bamI am trying to troubleshoot but my read group marking was done during bwa-mem alignment, which I have adhered to what samtools have suggested for SAM and BAM file creation. With the same way of entry, I got similar @RG in both my illumina and MGIseq BAM files, the only difference is that the former can't undergo recalibration. To my untrained eyes, the @RG for both of the BAM files seemed to be similar though...
Best wishes,
Robinn
-
Hi Robinn,
- The issue I am referring to is the RG line that I highlighted in my last comment from your header. There are "\t" characters instead of tabs between the fields, indicating that something went wrong when you tried to add read groups. Here is a document on SAM/BAM format on our site, and the standard specification.
- Both of your headers seem to be similar... but I want to focus on the problem in your other file. When you ran FlagStat and isolated the only read group that exists in your file (H753JDSXY.2), none of the reads were actually marked as the read group.
So, because of both problem #1 and problem #2, it is looking like your reads are not correctly marked with read groups. Please read our documentation on read groups and the other discussions on this forum for more information on read groups. Once they are correctly marked, I think you should have no issues with BQSR.
Thanks,
Genevieve
-
Dear Genevieve Brandt (she/her)
For issue no.1:
I have \t characters in my other bam files which have successful BQSRs, so I really do wonder if that is actually the main error though. When I use samtools with the log:
samtools view -H Kuu_sorted_dedup.bam | grep '^@RG'
@RG ID:PG3847_01_a LB:PG3847_01_A3535_H1 PL:Illumina PU:H753JDSXY.2 PM:Novaseq6000 SM:KuuKuu_tm_R1.fastq.gz
the above line came up, so is \t really the problem?
For issue no.2:
I am trying really hard to detect which part of the reads aren't marked, and also trying to find out how exactly am I supposed to identify if my reads were marked appropriately through the forum in GATK but there isn't a clear cut answer to this, so I am planning to remark the reads after I have gotten the answer for my question below:
As for marking read groups, may I know when should I be introducing read groups throughout the pipeline? The data preprocessing tutorial stated that read group is an expected requirement, but in nowhere does the tutorial state that where read groups should be first introduced. Should it be during BWA-MEM alignment, or is it probably better during SAM-BAM conversion etc? Maybe things could have gone wrong while marking the read groups from the reference alignment, I am not sure, so if you can guide me on this part it would be great.
Thank you so much.
Robinn
-
Hi Robinn,
For #1, I am not sure that it is the problem, the reason I brought it up is because it indicates that the read groups may have not been added correctly, since the format is wrong.
For #2, read groups should be added when you run BWA alignment. But you can add them after the fact with AddOrReplaceReadGroups.
-
Dear Genevieve,
For #1 I was looking on SAM specification but they didn't specify the format, but BWA-MEM homepage showed this:
complete read group header line. ’\t’ can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’.
so, I guess \t may not be the main problem?
For #2, so after BAM file creation, I can still add read groups and it shouldn't be a problem right? But the best preference would still be during bwa alignment, correct? In that case is there a sample command which I can refer to so I can compare and know what is different from my command entry in my BWA-MEM, if possible, please?
Thank you~
Robinn
-
Hi Robinn Teoh,
For #1, yes that is how you add read groups, but the \t should become a tab space if done correctly, not the actual "\t". That is why it indicates a problem.
#2, it doesn't matter at which stage you add them, as long as they are there to run GATK. The documentation for Picard's AddOrReplaceReadGroups is here: http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups
Best,
Genevieve
-
Dear Genevieve Brandt (she/her),
I have tried putting in the read groups from BWA-MEM and re run the whole pipeline, and here are my answers:
For #1, I found out what was the issue and it came from the entry where I first invoke the read groups:
bwa mem cf3.1toplevel -t 16 -R '@RG\tID:PG3847_01_a\tLB:PG3847_01_A3535_H1\tPL:Illumina\tPU:H753JDSXY.2 \tPM:Novaseq6000\tSM:Kuu'Kuu_tm_R1.fastq.gz Kuu_tm_R2.fastq.gz > Kuu.aln.sam
there was actually a space between tPU:H753JDSXY.2 \tPM:Novaseq6000 which wasn't obvious when being post here on the forum, hence that may have caused the \t error seen. When I removed the space, and I retrieve the header:
@RG ID:PG3847_03_a LB:PG3847_03_A3737_H1 PL:Illumina PU:H753JDSXY.2 PM:Novaseq6000 SM:Kaito
The RG above was retrieved in the right format.
However, this didn't solve my problem as when I tried to apply BQSR the same problem where the read group H753JDSXY.2 is still missing in my recalibrated table from Base Recalibrator. Does this mean I failed to mark my read group on my reads again? Then I realized I could actually retrieve the details of the reads on the first line of my fastq file, which is derived on the line below:
@A01084:55:H753JDSXY:2:1101:12255:1423 1:N:0:AGCCTCAT+AGTAGAGA
which corresponds to:
@machine id:run no:runid:laneid:titleno:x:y:filterpass:controlno:index seq
With respect to:
@RG\tID:$ID\tSM:$SM\tLB:$LB\tPU:$PU\tPL:$PL
I would like to seek for your advice on how should I be labeling my read group with the header on my sequence, should my \tID be H753JDSXY or H753JDSXY.2 when I invoke the read group in BWA-MEM?
I actually tried to change this (the @RG) using samtools reheader with my BAM files, but it doesn't allow downstream BaseRecalibrator or Mark Duplicate when I changed the @RG header, which I guess I should start the whole process from BWA-MEM again.
From the read group tutorial it stated that both ID and PU should share the same read group, meaning in this case it should be H753JDSXY.2 since it should include lane number too right?
It would be great if you can guide me through this, as the tutorial only showed example for a multiplex run but not single runs.
Thank you.
Robinn
-
Hi Robinn,
I am not sure what will be the best choice in your case, but I will add your question to our backlog for improvement of our read group documentation.
To get BQSR working, could you check again the FlagStat command that we looked at here: https://gatk.broadinstitute.org/hc/en-us/community/posts/360076232572/comments/360014084971
Please post the output of those commands.
Best,
Genevieve
-
Dear Genevieve Brandt (she/her)
Here are the outputs of the commands:
11:13:59.955 INFO FlagStat - 0 read(s) filtered by: WellformedReadFilter
874579472 read(s) filtered by: ReadGroupReadFilter
874579472 total reads filtered
11:13:59.958 INFO ProgressMeter - unmapped 56.9 0 0.0
11:13:59.958 INFO ProgressMeter - Traversal complete. Processed 0 total reads in 56.9 minutes.
11:13:59.958 INFO FlagStat - Shutting down engine
[February 19, 2021 11:13:59 AM JST] org.broadinstitute.hellbender.tools.FlagStat done. Elapsed time: 56.96 minutes.
Runtime.totalMemory()=738721792
Tool returned:
0 in total
0 QC failure
0 duplicates
0 mapped (�%)
0 paired in sequencing
0 read1
0 read2
0 properly paired (�%)
0 with itself and mate mapped
0 singletons (�%)
0 with mate mapped to a different chr
0 with mate mapped to a different chr (mapQ>=5)and for -RF NotSecondaryAlignmentReadFilter
12:29:25.301 INFO FlagStat - ------------------------------------------------------------
12:29:25.301 INFO FlagStat - The Genome Analysis Toolkit (GATK) v4.1.9.0
12:29:25.301 INFO FlagStat - For support and documentation go to https://software.broadinstitute.org/gatk/
12:29:25.301 INFO FlagStat - Executing as naika@DESKTOP-TMKR2LT on Linux v4.19.128-microsoft-standard amd64
12:29:25.302 INFO FlagStat - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_275-8u275-b01-0ubuntu1~20.04-b01
12:29:25.302 INFO FlagStat - Start Date/Time: February 19, 2021 12:29:24 PM JST
12:29:25.302 INFO FlagStat - ------------------------------------------------------------
12:29:25.302 INFO FlagStat - ------------------------------------------------------------
12:29:25.317 INFO FlagStat - HTSJDK Version: 2.23.0
12:29:25.317 INFO FlagStat - Picard Version: 2.23.3
12:29:25.317 INFO FlagStat - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:29:25.317 INFO FlagStat - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:29:25.317 INFO FlagStat - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:29:25.317 INFO FlagStat - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:29:25.317 INFO FlagStat - Deflater: IntelDeflater
12:29:25.317 INFO FlagStat - Inflater: IntelInflater
12:29:25.317 INFO FlagStat - GCS max retries/reopens: 20
12:29:25.317 INFO FlagStat - Requester pays: disabled
12:29:25.317 INFO FlagStat - Initializing engine
12:29:25.928 INFO FlagStat - Done initializing engine
12:29:25.928 INFO ProgressMeter - Starting traversal
12:29:25.928 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
13:17:01.181 INFO FlagStat - 0 read(s) filtered by: WellformedReadFilter
874579472 read(s) filtered by: ReadGroupReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
874579472 total reads filtered
13:17:01.182 INFO ProgressMeter - unmapped 47.6 0 0.0
13:17:01.182 INFO ProgressMeter - Traversal complete. Processed 0 total reads in 47.6 minutes.
13:17:01.182 INFO FlagStat - Shutting down engine
[February 19, 2021 1:17:01 PM JST] org.broadinstitute.hellbender.tools.FlagStat done. Elapsed time: 47.61 minutes.
Runtime.totalMemory()=715653120
Tool returned:
0 in total
0 QC failure
0 duplicates
0 mapped (�%)
0 paired in sequencing
0 read1
0 read2
0 properly paired (�%)
0 with itself and mate mapped
0 singletons (�%)
0 with mate mapped to a different chr
0 with mate mapped to a different chr (mapQ>=5)I will also begin a new BWA-MEM alignment again just to be safe, with ID and PU sharing the same input for my read group.
Robinn
-
Hi Robinn,
It looks like again all of your reads are filtered out because they do not match the read group you filtered for.
Once the ReadGroupReadFilter is applied, 0 reads remain.
In our documentation about BAM files, there is a section on viewing the read group on a specific line of your BAM [More about Read Groups]. Could you look at your BAM file and see what read group is there?
Best,
Genevieve
-
Maybe someone will come in handy ...
samtools reheader only changes the header.
You needed to compare the results of the commands:
samtools view -H Kuu_sorted_dedup.bam | grep '^@RG'
samtools view Kuu_sorted_dedup.bam | grep -m1 'RG:.:.'After, if necessary, use picard AddOrReplaceReadGroups
java -jar picard.jar AddOrReplaceReadGroups \
I=Kuu_sorted_dedup.bam \
O=Kuu_sorted_dedup_newRG.bam \
RGID=${ID} \ # that is, RGID will be RG:Z:${ID} !!!
RGLB=${LB} \
RGPL=${PL} \
RGPU=${PU} \ # does not work without RGPU !!!
RGSM=${SM} -
... and WSL is not recommended for production. Results may differ from real Linux (tested in WSL1).
-
Thank you for posting your solution to this post Alex Falconer!
Please sign in to leave a comment.
44 comments