Using GATK 22.214.171.124 (via docker image `broadinstitute/gatk:126.96.36.199`), I'm attempting to speed up marking dups by using MarkDuplicatesSpark locally. This is using a fast SSD storage tier on our cluster, but I'm seeing better performance from the base MarkDuplicates command on a single core than I am from MarkDuplicatesSpark on 8 cores. I'm wondering if there are any settings that I'm missing here.
The input data is a cell line exome bam (HCC1395) that has been aligned and namesorted, and is staged on the same fast SSD tier where i'm specifying temp space.
"Base" command - run with a single core, 32G of RAM - completes in 59 minutes:
/gatk/gatk MarkDuplicates -I namesorted.bam -O /dev/stdout -METRICS_FILE markdup_spark.metrics -ASSUME_SORT_ORDER queryname -QUIET true -COMPRESSION_LEVEL 0 -VALIDATION_STRINGENCY LENIENT | /gatk/gatk SortSam -I /dev/stdin -O markdup_base_sorted.bam --SORT_ORDER coordinate
Spark command - run with 8 cores and 64G of RAM - completes in 120 minutes
/gatk/gatk MarkDuplicatesSpark -I namesorted.bam -O markdup_spark.8.bam --conf 'spark.executor.cores=8' --QUIET true --conf 'spark.local.dir=/scratch1/fs1/sparktest/tmp'
Are there additional params I could/should be providing? Happy to share logs or data if that helps - thanks!
Please sign in to leave a comment.