Background: I'm interested in inter- and intra- tumoral diversity with time. Therefore I have about 50 samples per patient. To end up with a complete matrix of loci and their mutational calls for each sample, I'm using the multisample functionality of Mutect2. I'm using the scatter-gather approach to parallelize Mutect2 over 500 genomic intervals on our institution's cluster computer. Each node is using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation, however I've set the number of threads requested to 1 while I'm tweaking the mutect2 performance parameters.
-I normal -I tumor1 -I tumor2 ... -I tumor50
--java-options "-Djava.io.tmpdir=tmpDir -Xmx4G -XX:ParallelGCThreads=1"
This runs just fine except when it reaches the very last interval which includes all the HLA and decoy loci. At this point the compute time can be as long as a day, whereas the other intervals complete in a matter of hours.
What I've Found So Far:
I found this legacy forum post. A user named davidben wrote the following, albeit in regard to a slightly different question:
A back-up plan that might help is to introduce some very conservative downsampling: `—downsampling-stride 20 —max-reads-per-alignment-start 6 —max-suspicious-reads-per-alignment-start 6`. This will only truncate pathological regions of extreme depth due to mapping error.
- Does anyone have any experience with the down sampling method described above?
- What constitutes a "suspicious read"?
- Are there any other options that I'm not considering?
- 2023-08-28: Corrected description of last interval - it's actually the "decoy" loci and the HLA loci that are causing the hang up.
- 2023-08-28: By implementing the down sampling method above, the intervals finish within a reasonable time frame; however, these intervals seem to be consuming more memory (>4GB) causing several errors.
Please sign in to leave a comment.