Some MarkDuplicates chunks take a very long time to run
This run was done using GATK 4.1.6.0, the input BAM had ~600M reads. I'm curious why some chunks of MarkDuplicates take so long to process. Most of the chunks take 2-4s, while others can take up 5280s (as shown in the image below).
I checked the duplicate read ratio of this problem region, but it doesn't differ much from the adjacent chunks that were processed in 3-4s, so total number of duplicates doesn't seem to be the cause.
What is going on here that causes this drastic leap in processing time and is there anything I can do to make this run faster?
Here is the whole MarkDuplicates command:
gatk MarkDuplicates --INPUT input.bam --METRICS_FILE output.metrics --REMOVE_DUPLICATES false --ASSUME_SORT_ORDER coordinate --OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 \
--VALIDATION_STRINGENCY SILENT --CREATE_INDEX true --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false \
--REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --ADD_PG_TAG_TO_READS true --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES \
--PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 \
--VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false
--version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Further into the pipeline, I'm noticing that Mutect is processing this region (in a scatter split) quite slowly as well (1.5-3x slower than other scatters). Could there be any related features in the data in this region that makes it cumbersome for these GATK to process them?
-
Hi there! The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
That being said, I'll offer a quick few thoughts though it doesn't answer your main question. Have you tried MarkDuplicatesSpark to increase speed? It will typically run faster than MarkDuplicates and SortSam by a factor of 15% over the same data at 2 cores and will scale linearly to upwards of 16 cores.
This thread offers thoughts on settings.
Please sign in to leave a comment.
1 comment