Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

MarkDuplicates returns error with Multilane samples?

0

7 comments

  • Avatar
    danilovkiri

    vctrymao

    The exception explicitly states that the BAM files should be sorted by queryname. Can you confirm that the input BAM files are queryname sorted? If not, `samtools sort -n` will help.

    0
    Comment actions Permalink
  • Avatar
    vctrymao

    I tried using Picard ReorderSam but that didn't help. Is samtools sort different?

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Have you read the doc https://gatk.broadinstitute.org/hc/en-us/articles/360037426651-ReorderSam-Picard-?

    Googling a tool prior to its usage is a great way to understand what it actually does. The main point is that it DOES NOT sort. I'm not going to compare these tools myself, it's up to you. Judging by the very first sentence in the ReorderSam documentation, it is completely different from samtools sort.

    0
    Comment actions Permalink
  • Avatar
    vctrymao

    Thank you. I tried doing `samtools sort -n ...` but now it seems I cannot create an index for the BAM file using samtools, since it gives me errors like "Unsorted positions on sequence #33: 49090746 followed by 49090586".

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    An index can be created on coordinate-sorted files only. MarkDuplicates does not require a BAI index for an input BAM file.

    0
    Comment actions Permalink
  • Avatar
    vctrymao

    I retried this using `samtools sort -n`, however I am getting the same error still...

    0
    Comment actions Permalink
  • Avatar
    Louis Bergelson

    There's a longstanding (and frankly stupid) issue with htsjdk reading samtools queryname sorted files.  Htslib and htsjdk disagree about what it means to coordinate sort files.  Htslib uses a sort that tries to sort things-numerically aware (i.e. a2 a3 sort before a21), while htsjdk uses simple lexographical sorting.  Htjsdk checks that things are sorted the way it likes and rejects samtools queryname files.  I'm not sure why you're getting you're original error if you sorted by queryname using SortSam, but samtools sort -n is definitely going to cause problems.  I might try updating to the newest version of gatk (gatk 4.1.8.0) and using MarkDuplicatesSpark which should be able to consume an unsorted sam and sort it for you while marking duplicates.  You could also look at the best practice wdls which *should* just work assuming your data isn't something very different than illumina human data.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk