Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

markDuplicatesSpark running out of memory



  • Avatar
    Genevieve Brandt (she/her)

    Hi Matt Snyder,

    Thanks for writing into the GATK forum! We took a look and have some ideas to troubleshoot this issue:

    1. We don't normally run MarkDuplicatesSpark on multiple bams. Could you try running MergeBamAlignment on your alignment output and use only one bam as input to MarkDuplicatesSpark?
    2. Make sure you are specifying spark local options: 
      We would recommend that you decrease the cores because running MarkDuplicatesSpark doesn't see much speed up anyway with more cores in local mode.

    Let me know how this goes for you.



    Comment actions Permalink
  • Avatar
    Matt Snyder

    Thanks, Genevieve-Brandt-she-her!

    1. I will try merging the bam files first. This also makes sense because then I only have to sort one BAM file instead of dozens.
    2. Do you have a recommended number of threads? I actually only requested 16 cores in the cloud worker, but since the huge ram I requested requires a bigger instance, I got 48. I'll try specifying 16 in the call to MarkDuplicatesSpark.

    I'll give this a shot and let you know how it goes.


    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Matt Snyder 4-8 would be the most efficient but if you have more cores you could use up to 16. 

    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk