MarkDuplicates returns error with Multilane samples?
I am using GATK 4.0.0.0. I am following this guide (https://gatk.broadinstitute.org/hc/en-us/articles/360035889471-How-should-I-pre-process-data-from-multiplexed-sequencing-and-multi-library-designs-) in order to run HaplotypeCaller on multiplexed samples.
According to the guide, I am supposed to run mapping and sorting before MarkDuplicates. I ran mapping and SortAndFixTags according to the GATK wdl (which comprises of SortSam and SetNmAndUqTags), and then tried running MarkDuplicates, but I get this error. Am I doing something wrong?
Exception in thread "main" java.lang.IllegalArgumentException: Alignments added out of order in SAMFileWriterImpl.addAlignment for file:///home/clb36/parkhome/Juan-PTA/.PreProcessing/MarkDuplicates/4772-JW-6_S125_L004.bam. Sort order is queryname. Offending records are at [A00758:88:HFCTYDSXY:4:2571:25093:6652] and [A00758:88:HFCTYDSXY:4:2244:30309:5102]
at htsjdk.samtools.SAMFileWriterImpl.assertPresorted(SAMFileWriterImpl.java:213)
at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:200)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:406)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
-
The exception explicitly states that the BAM files should be sorted by queryname. Can you confirm that the input BAM files are queryname sorted? If not, `samtools sort -n` will help.
-
I tried using Picard ReorderSam but that didn't help. Is samtools sort different?
-
Have you read the doc https://gatk.broadinstitute.org/hc/en-us/articles/360037426651-ReorderSam-Picard-?
Googling a tool prior to its usage is a great way to understand what it actually does. The main point is that it DOES NOT sort. I'm not going to compare these tools myself, it's up to you. Judging by the very first sentence in the ReorderSam documentation, it is completely different from samtools sort.
-
Thank you. I tried doing `samtools sort -n ...` but now it seems I cannot create an index for the BAM file using samtools, since it gives me errors like "Unsorted positions on sequence #33: 49090746 followed by 49090586".
-
An index can be created on coordinate-sorted files only. MarkDuplicates does not require a BAI index for an input BAM file.
-
I retried this using `samtools sort -n`, however I am getting the same error still...
-
There's a longstanding (and frankly stupid) issue with htsjdk reading samtools queryname sorted files. Htslib and htsjdk disagree about what it means to coordinate sort files. Htslib uses a sort that tries to sort things-numerically aware (i.e. a2 a3 sort before a21), while htsjdk uses simple lexographical sorting. Htjsdk checks that things are sorted the way it likes and rejects samtools queryname files. I'm not sure why you're getting you're original error if you sorted by queryname using SortSam, but samtools sort -n is definitely going to cause problems. I might try updating to the newest version of gatk (gatk 4.1.8.0) and using MarkDuplicatesSpark which should be able to consume an unsorted sam and sort it for you while marking duplicates. You could also look at the best practice wdls which *should* just work assuming your data isn't something very different than illumina human data.
Please sign in to leave a comment.
7 comments