Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

MarkDuplicatesSpark substantially slower than original

0

5 comments

  • 0
    Comment actions Permalink
  • Avatar
    Chris Miller

    I've gone through the options several different ways, giving as many as 16 cores and 256Gb of RAM to the spark process and I'm still seeing much slower results than when running with the non-parallel markdups. Is there anything obviously wrong with my command here?

    /gatk/gatk MarkDuplicatesSpark -I merged.bam -O markdup_spark_local16_256.bam --spark-master local[16] --QUIET true --conf 'spark.local.dir=/scratch1/fs1/sparktest/tmp'

    I do see over 11,000 of these lines in the log, which leads me to suspect that it's doing a ton of disk reads/writes, and that could obviously be a chokepoint.  

    14:14:04.359 INFO  FileOutputCommitter - Saved output of task 'attempt_20201015043238_0037_r_007720_0' to file:/scratch1/fs1/sparktest/
    WGS_Norm2/markdup_spark_local8.bam.parts/_temporary/0/task_20201015043238_0037_r_007720

    Are there settings I'm not considering here (maybe --bam-partition-size)?  Sadly, the local spark tutorial is empty (https://gatk.broadinstitute.org/hc/en-us/articles/360035889871) so I'm having trouble figuring out which options are applicable to a local run, as opposed to a full-fledged spark cluster.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Hi Chris Miller, I followed up with my team and we don't think that this should be occurring. Here are some follow up questions to determine where the issue may be:

    • How many contigs do you have? Are you working with human data?
    • How large is the BAM file?
    • Spark may not be using all of the available memory properly. Could you try the same tests while specifying java memory allocation with java -Xmx (gatk --java-options "-Xmx4G" [program arguments]

       

    0
    Comment actions Permalink
  • Avatar
    Chris Miller

    This test was being run on a human exome bam file, and I also tried a ~30x WGS bam with similar results. I did try the direct memory allocation specification that you suggested and saw no substantial difference. I suspect I/O is the limiting resource, as the cluster in question occasionally has high latency. In the end, we moved on to optimizing other parts of our pipeline. Thanks for the responses and the attempts at troubleshooting!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Thanks for the update Chris Miller! For the cluster you were using, was it a network drive or local? 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk