ApplyBQSR to unmapped reads for non-model organism does not work with '-L unmapped' flag
- GATK version used:
4.1.2.0 and 4.1.4.0
- Exact command used:
This does NOT work:
bam_in=whole_genome_sorted.bam
gatk ApplyBQSR --java-options "-Xmx4G -Dsamjdk.compression_level=5 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" -R $ref -I $bam_in --bqsr-recal-file $table -L unmapped -O $bam_out -OBI false
This DOES work:
bam_in=f12_unmapped.bam
gatk ApplyBQSR --java-options "-Xmx4G -Dsamjdk.compression_level=5 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" -R $ref -I $bam_in --bqsr-recal-file $table -O $bam_out -OBI false
- Entire error log:
08:49:22.056 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/apps/gatk/4.1.4.0/gatk-package-4.1.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Sep 03, 2020 8:49:22 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
08:49:22.223 INFO ApplyBQSR - ------------------------------------------------------------
08:49:22.224 INFO ApplyBQSR - The Genome Analysis Toolkit (GATK) v4.1.4.0
08:49:22.224 INFO ApplyBQSR - For support and documentation go to https://software.broadinstitute.org/gatk/
08:49:22.224 INFO ApplyBQSR - Executing as cew562@gadi-login-06.gadi.nci.org.au on Linux v4.18.0-193.6.3.el8.nci.x86_64 amd64
08:49:22.224 INFO ApplyBQSR - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_252-b09
08:49:22.224 INFO ApplyBQSR - Start Date/Time: September 3, 2020 8:49:22 AM AEST
08:49:22.224 INFO ApplyBQSR - ------------------------------------------------------------
08:49:22.224 INFO ApplyBQSR - ------------------------------------------------------------
08:49:22.224 INFO ApplyBQSR - HTSJDK Version: 2.20.3
08:49:22.224 INFO ApplyBQSR - Picard Version: 2.21.1
08:49:22.225 INFO ApplyBQSR - HTSJDK Defaults.COMPRESSION_LEVEL : 5
08:49:22.225 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
08:49:22.225 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
08:49:22.225 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
08:49:22.225 INFO ApplyBQSR - Deflater: IntelDeflater
08:49:22.225 INFO ApplyBQSR - Inflater: IntelInflater
08:49:22.225 INFO ApplyBQSR - GCS max retries/reopens: 20
08:49:22.225 INFO ApplyBQSR - Requester pays: disabled
08:49:22.225 INFO ApplyBQSR - Initializing engine
08:49:22.439 INFO IntervalArgumentCollection - Processing 1 bp from intervals
08:49:22.441 INFO ApplyBQSR - Done initializing engine
08:49:22.470 INFO ProgressMeter - Starting traversal
08:49:22.471 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
08:49:23.024 INFO ApplyBQSR - Shutting down engine
[September 3, 2020 8:49:23 AM AEST] org.broadinstitute.hellbender.tools.walkers.bqsr.ApplyBQSR done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2107113472
java.lang.IllegalArgumentException: Reference name for '187393' not found in sequence dictionary.
at htsjdk.samtools.SAMRecord.resolveNameFromIndex(SAMRecord.java:573)
at htsjdk.samtools.SAMRecord.setReferenceIndex(SAMRecord.java:426)
at htsjdk.samtools.BAMRecord.<init>(BAMRecord.java:110)
at htsjdk.samtools.DefaultSAMRecordFactory.createBAMRecord(DefaultSAMRecordFactory.java:42)
at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:283)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:866)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:840)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.<init>(BAMFileReader.java:820)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.<init>(BAMFileReader.java:808)
at htsjdk.samtools.BAMFileReader$BAMFileIndexUnmappedIterator.<init>(BAMFileReader.java:1091)
at htsjdk.samtools.BAMFileReader$BAMFileIndexUnmappedIterator.<init>(BAMFileReader.java:1090)
at htsjdk.samtools.BAMFileReader.queryUnmapped(BAMFileReader.java:678)
at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryUnmapped(SamReader.java:543)
at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextIterator(SamReaderQueryingIterator.java:129)
at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.<init>(SamReaderQueryingIterator.java:66)
at org.broadinstitute.hellbender.engine.ReadsDataSource.prepareIteratorsForTraversal(ReadsDataSource.java:404)
at org.broadinstitute.hellbender.engine.ReadsDataSource.iterator(ReadsDataSource.java:330)
at java.lang.Iterable.spliterator(Iterable.java:101)
at org.broadinstitute.hellbender.utils.Utils.stream(Utils.java:1099)
at org.broadinstitute.hellbender.engine.GATKTool.getTransformedReadStream(GATKTool.java:377)
at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:93)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)
Using GATK jar /apps/gatk/4.1.4.0/gatk-package-4.1.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4G -Dsamjdk.compression_level=5 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /apps/gatk/4.1.4.0/gatk-package-4.1.4.0-local.jar ApplyBQSR -R ./Reference/GCA_902635505.1_mSarHar1.11_genomic.fna -I ./Dedup_sort/Esquivel.coordSorted.dedup.bam --bqsr-recal-file ./BQSR_recal_tables/Round1/Esquivel.recal_data.table -L unmapped -O unmapped_recal.bam -OBI false
The '-L unmapped' command works perfectly on human, but not on the Tasmanian Devil. This genome has very large autosomes so the BAM is CSI indexed. Recalibration using the '-L interval' flag works perfectly on all Tas Devil chromosomes and unplaced contigs. It is just the 'unmapped' that does not work. The "Reference name for '187393' not found in sequence dictionary" error is identical in all repeated test runs (unlike some other posts I have seen where the number differs between runs). This value 187393 does not grep from the .dict file. Extracting out the f12 unmapped reads to a new BAM with samtools and running ApplyBQSR without the '-L unmapped' flag results in no error and a recalibrated unmapped BAM file. While this workarond is easy to implement, I would very much like to know the source of the error when using '-L unmapped'.
Many thanks,
Cali
-
Hello cali, can you look at your BAM file header and the FASTA dict file and confirm that the @SQ lines in your bam file match your dict file?
-
Thanks for your reply. Only the header differs:
$ sdiff -s bam_headers.txt dict_first3cols.txt
@HD VN:1.3 SO:coordinate | @HD VN:1.4 SO:unsortedI would be surprised if this was the cause - particularly because the interval flag works fine for the contigs.
Many thanks,
Cali
-
cali could you validate your BAM file following these instructions?
-
ValidateSamFile in summary mode shows the following:
Error Type Count
ERROR:INVALID_INSERT_SIZE 134030
WARNING:REF_SEQ_TOO_LONG_FOR_BAI 1And with "IGNORE_WARNINGS" and verbose mode shows the first 100 events of "Insert size out of range" then exits, eg
ERROR: Record 4791, Read name ST-E00180:231:H7CW3ALXX:2:2124:6705:68482, Insert size out of range
Investigating the first record shows that the error is thrown by GATK disliking the very large chomosomes of this species (chr 1 is 716 Mb):
ST-E00180:231:H7CW3ALXX:2:2124:6705:68482 435 LR735554.1 10444 5 15H40M95H = 652183949 652173615 GGGGGTGGGGGGAAACAGGGAAAAAATTGGAAAAAAAGGT ))7)-))))7----))7----A7-----7----7F-77AA NM:i:2 MD:Z:20G11C7 MC:Z:149M AS:i:30 XS:i:27 RG:Z:H7CW3ALXX.2_659BigJo_1 SA:Z:LR735554.1,652183849,+,21M1D60M69S,37,2; XA:Z:LR735560.1,+1334233,103S32M15S,1;
Clearly this is not a BAM format error as an inferred insert size of this scale is biologically possible for this species (though probably a mapping error in this case). Is it safe to ignore these warnings? Can this be fixed in future GATK releases?
Many thanks,
Cali
-
Hi cali, I will look at this with my team and let you know when I have more information.
-
Hi cali,
The issue with ValidateSamFile does not seem to be related to your BQSR issue. PICARD has been updated since the version you are using to allow for a larger insert size. The INVALID_INSERT_SIZE error should disappear. Our current GATK version is 4.1.8.1.
For the above issue with BQSR, I have some follow up questions:
- What tool did you use to produce these bams?
- If you use samtools to extract the unmapped only reads, do you get a warning? It may be something like: [W::sam_parse1] urecognized reference name; treated as unmapped
- How many contigs are in the reference?
-
- The BAMs were aligned with BWA-mem v 0.7.17, sorted and indexed (CSI) with SAMtools v 1.10, BAMs merged to per-sample BAM with SAMbamba v 0.7.1, duplicate reads marked with SAMblaster v 0.1.24.
- There are no errors when the unmapped reads are extracted with samtools. Extracting out the f12 unmapped reads to a new BAM with samtools and running ApplyBQSR without the '-L unmapped' flag results in no error and a recalibrated unmapped BAM file.
- There are 105 contigs in the reference
- I just tested the recalibration using '-L unmapped' with 4.1.8.1 and the same error was produced as per 4.1.2.0 and 4.1.4.0.
Kind regards,
Cali
-
Hi cali, could you check these things?
- Look for a read in your BAM file where the contig name is '187393'. You can use PrintReads -L 187393.
- The sequence dictionary or BAM index files may be corrupt. Re-create the sequence dictionary with CreateSequenceDictionary. Re-create the BAM index with IndexFeatureFile.
Please sign in to leave a comment.
8 comments