Optimizing Mutect2 threads with a scatter-gathering approach
Getting deeper into a custom-implementation to scatter-gather multiprocessing with Mutect2, I noticed (via htop) there were 15 child processes associated with each actual Mutect2 call. Although most of the processes were suspended, I was wondering if these extra processes (threads?) were necessary/hurting my overall performance.
This article (https://gatk.broadinstitute.org/hc/en-us/articles/360035532372-Java-is-using-too-many-resources-threads-memory-or-CPU-) made me think several of these could just be garbage collecting threads that I could remove via java-options.
I then came across this stackexchange article (https://bioinformatics.stackexchange.com/questions/4608/increase-number-of-threads-for-gatk-4-0-haplotypecaller) discussing the --native-pair-hmm-threads parameter (which exists in Mutect2 with a default value of 4).
I didn't see either of these parameters being tweaked in the gatk github workflow for Mutect2 (https://github.com/gatk-workflows/gatk4-somatic-snvs-indels/blob/master/mutect2.wdl), but it seems reasonable to set --native-pair-hmm-threads to 1 for the scatter gathering approach to parallelizing Mutect. Or perhaps, if I only scatter to a fraction of my total processors, I could use this argument to benefit from some HMM threading on the available processors?
As for the garbage collecting threads, I'm not sure how important they are in the grand scheme of this, since I'm doing all right on overall RAM being used. Also, I'm wondering if there are any other parameters/considerations for threading that I'm missing in regards to Mutect scatter-gathering performance.
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
Hello, I use the 'Scatter Gather' mode, and run mutect2 in parallel on separate chromosomes, but there is a consistency problem between the results obtained by using the 'Scatter Gather' mode and the results obtained by running without this mode. Excuse me, how to ensure consistent use of the 'Scatter Gather' mode.
See answer on github page, but essentially the answer is: the inconsistencies will not exist in the next version of Mutect, which is coming out soon.
Please sign in to leave a comment.