Mutect2 in multisample mode slows at HLA Loci in hg38 alt-aware alignments
Background: I'm interested in inter- and intra- tumoral diversity with time. Therefore I have about 50 samples per patient. To end up with a complete matrix of loci and their mutational calls for each sample, I'm using the multisample functionality of Mutect2. I'm using the scatter-gather approach to parallelize Mutect2 over 500 genomic intervals on our institution's cluster computer. Each node is using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation, however I've set the number of threads requested to 1 while I'm tweaking the mutect2 performance parameters.
Pseudo Code:
gatk Mutect2
-R hg38
-L 1..500
-I normal -I tumor1 -I tumor2 ... -I tumor50
--normal-sample normal
--germline-resource gnomadAf
--panel-of-normals pon
-O patient.vcf
--native-pair-hmm-threads 1
--java-options "-Djava.io.tmpdir=tmpDir -Xmx4G -XX:ParallelGCThreads=1"
This runs just fine except when it reaches the very last interval which includes all the HLA and decoy loci. At this point the compute time can be as long as a day, whereas the other intervals complete in a matter of hours.
What I've Found So Far:
I found this legacy forum post. A user named davidben wrote the following, albeit in regard to a slightly different question:
A back-up plan that might help is to introduce some very conservative downsampling: `—downsampling-stride 20 —max-reads-per-alignment-start 6 —max-suspicious-reads-per-alignment-start 6`. This will only truncate pathological regions of extreme depth due to mapping error.
Questions:
- Does anyone have any experience with the down sampling method described above?
- What constitutes a "suspicious read"?
- Are there any other options that I'm not considering?
Edits:
- 2023-08-28: Corrected description of last interval - it's actually the "decoy" loci and the HLA loci that are causing the hang up.
- 2023-08-28: By implementing the down sampling method above, the intervals finish within a reasonable time frame; however, these intervals seem to be consuming more memory (>4GB) causing several errors.
-
This davidben fellow's advice is solid* but you might need to scale up some values for multi-sample mode. The stride can stay at 20 but the reads per alignment start pertain to the total depth over all samples. For example, the current value will downsample to 6*20 = 120 reads starting in every 20-base window.
The combination of many samples with a highly polymophic region such as the HLA is inevitably going to push Mutect2 to its limits. I would also try to rein in the complexity of the local assembly by setting the mapping quality read filter to a higher threshold like 40 or 50. You should also experiment with the --linked-de-bruijn-graph argument.Congratulations, by the way, on posing a question where the answer is not to stick with the defaults! And please let us know how these settings go. This is uncharted territory for us.
* He's the lead developer of Mutect2.** He's also me.
Please sign in to leave a comment.
1 comment