Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Proper way to prepare .bam for MarkDuplicatesSpark



  • Avatar
    Derek Caetano-Anolles

    Hi samuel 

    Welcome to using GATK! I hope that we can help get your pipeline sorted out and functioning.

    To start off with, your input command appears to be using `-conf` instead of the `--conf` option. I didn't see an error related to that in your log file, so I am not sure that is the issue, but just in case I'd make sure your commands follow the tool doc.

    Unfortunately, we do not typically run multi-sample BAMs with MarkDuplicatesSpark. This doesn't necessarily mean that the tools will never work with multi-sample BAMs, just that it would really depend on the way your input is formatted.

    You could try making sure that each read in your input file has the correct read group and library information for that sample and try again. If MarkDuplicatesSpark doesn't work, try using Picard's MarkDuplicates, which works differently than the Spark version and may end up giving you the result you want. No promises though, since MarkDuplicates (like the Spark version) doesn't explicitly support multi-sample BAMs.

    That said, the fastest way to get your duplicates marked would definitely be to split up your BAM into single samples, and run MarkDuplicatesSpark on them that way. Since the tool is expecting single-sample inputs, feeding in the right kind of files should resolve the issue. Try re-generating your BAMs if you have the raw data, or (if you only have the BAM) use samtools or another comparable tool to split them up based sample/RG.

    I hope that helps!

    Comment actions Permalink
  • Avatar

    Hi @Derek Caetano-Anolles,

    Thanks for the clarification! I will use single sample bam from now

    I didn't get an out of space error when i ran the program so the missing hyphen in --conf is probably a typo when i was typing the post.

    Yes, i went ahead and used MarkDuplicates, i could get it to work after i included SM into the RG tags. I will re-run it with single sample bam and compare the results. 

    Thanks a lot!

    Comment actions Permalink
  • Avatar
    Derek Caetano-Anolles

    Great! I'm glad it worked out. Don't hesitate to come back with more questions if you have GATK-related trouble further on down the line.

    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk