MarkDuplicatesSpark substantially slower than original
Using GATK 4.1.8.1 (via docker image `broadinstitute/gatk:4.1.8.1`), I'm attempting to speed up marking dups by using MarkDuplicatesSpark locally. This is using a fast SSD storage tier on our cluster, but I'm seeing better performance from the base MarkDuplicates command on a single core than I am from MarkDuplicatesSpark on 8 cores. I'm wondering if there are any settings that I'm missing here.
The input data is a cell line exome bam (HCC1395) that has been aligned and namesorted, and is staged on the same fast SSD tier where i'm specifying temp space.
"Base" command - run with a single core, 32G of RAM - completes in 59 minutes:
/gatk/gatk MarkDuplicates -I namesorted.bam -O /dev/stdout -METRICS_FILE markdup_spark.metrics -ASSUME_SORT_ORDER queryname -QUIET true -COMPRESSION_LEVEL 0 -VALIDATION_STRINGENCY LENIENT | /gatk/gatk SortSam -I /dev/stdin -O markdup_base_sorted.bam --SORT_ORDER coordinate
Spark command - run with 8 cores and 64G of RAM - completes in 120 minutes
/gatk/gatk MarkDuplicatesSpark -I namesorted.bam -O markdup_spark.8.bam --conf 'spark.executor.cores=8' --QUIET true --conf 'spark.local.dir=/scratch1/fs1/sparktest/tmp'
Are there additional params I could/should be providing? Happy to share logs or data if that helps - thanks!
-
Hi Chris Miller, please see these resources for more information:
-
I've gone through the options several different ways, giving as many as 16 cores and 256Gb of RAM to the spark process and I'm still seeing much slower results than when running with the non-parallel markdups. Is there anything obviously wrong with my command here?
/gatk/gatk MarkDuplicatesSpark -I merged.bam -O markdup_spark_local16_256.bam --spark-master local[16] --QUIET true --conf 'spark.local.dir=/scratch1/fs1/sparktest/tmp'
I do see over 11,000 of these lines in the log, which leads me to suspect that it's doing a ton of disk reads/writes, and that could obviously be a chokepoint.
14:14:04.359 INFO FileOutputCommitter - Saved output of task 'attempt_20201015043238_0037_r_007720_0' to file:/scratch1/fs1/sparktest/
WGS_Norm2/markdup_spark_local8.bam.parts/_temporary/0/task_20201015043238_0037_r_007720Are there settings I'm not considering here (maybe --bam-partition-size)? Sadly, the local spark tutorial is empty (https://gatk.broadinstitute.org/hc/en-us/articles/360035889871) so I'm having trouble figuring out which options are applicable to a local run, as opposed to a full-fledged spark cluster.
-
Hi Chris Miller, I followed up with my team and we don't think that this should be occurring. Here are some follow up questions to determine where the issue may be:
- How many contigs do you have? Are you working with human data?
- How large is the BAM file?
- Spark may not be using all of the available memory properly. Could you try the same tests while specifying java memory allocation with java -Xmx (gatk --java-options "-Xmx4G" [program arguments])
-
This test was being run on a human exome bam file, and I also tried a ~30x WGS bam with similar results. I did try the direct memory allocation specification that you suggested and saw no substantial difference. I suspect I/O is the limiting resource, as the cluster in question occasionally has high latency. In the end, we moved on to optimizing other parts of our pipeline. Thanks for the responses and the attempts at troubleshooting!
-
Thanks for the update Chris Miller! For the cluster you were using, was it a network drive or local?
Please sign in to leave a comment.
5 comments