Parallelizing Mutect2
Is there any proper way of parallelizing Mutect2 for GATK 4.1.2.0? I had read the Spark was not properly working for recent versions of Mutect2 so I did not use that.
-
Hi ,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
-
There is no Spark version of Mutect2, nor is there any multiprocessing on a single machine. The Mutect2 WDL and featured workflow on Terra parallelize by scattering over multiple machines, and the Google cloud compute cost is quite low.
-
I noticed from this link (https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2) that one can parallelize Mutect2 over chromosomes using the command:
for chromosome in {1..22}; do
gatk Mutect2 -R ref.fasta -I tumor.bam -L $chromosome --f1r2-tar-gz ${chromosome}-f1r2.tar.gz -O ${chromosome}-unfiltered.vcf
doneTwo questions:
1. Is there a reason why this doesn't include the sex chromosomes? It should still work for those, right?
2. If I have both male and female BAMs, do I have to separate them? Or can I include the Y chromosome in parallelizing the female BAMs (and if I do, will Mutect ignore it)?
-
That link is just a pedagogical example showing how to handle the -f1r2-tar-gz output for a scattered job. The best way is scatter is through the featured workflow, which handles merging of not only the scattered VCFs but also F1R2 tables, bamouts, pileup summaries for calculating contamination etc.
To answer your second question, you may run male and female samples with the same intervals and scatters. The female samples will simply yield VCFs with no records in the Y chromosome, which is fine.
-
It was my understanding that GATK usually uses interval lists to parallelize the workflow. I remember seeing that GATK has these available for WGS and Broad-based sequencing (https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists), but in the event that we aren't using these methods, how should we proceed?
I also see that the scattered workflow on Github (https://github.com/gatk-workflows/gatk4-somatic-snvs-indels/blob/master/mutect2.wdl) that SplitIntervals is used. Do we need an interval list to use this feature?
For example, if I have WES sequencing done without interval lists, can I still use SplitIntervals on the relevant reference file only to parallelize calling?
-
vctrymao Providing an interval list via the -L argument does not parallelize the GATK. Rather it tells the tool what intervals to process sequentially.
An interval list is optional in the Mutect2 WDL. If you do not provide one the pipeline splits the reference into chunks with roughly equal sizes (in base pairs, not by contig, eg if you scatter into 200Mb chunks one scatter is 1:1-200,000,000, the next contains 1:200,000,000-260,000,000 and 2:1-140,000,000, and the last one contains a bit of X, all of Y, and all the little decoy and alt contigs).
-
To follow up on this, is an interval list required for HaplotypeCaller? If so, why is it required there and not for Mutect?
-
Interval lists are optional for both tools.
-
And SplitIntervals should work for HaplotypeCaller too then, without interval lists?
-
Yes, if given no intervals SplitIntervals splits the entire reference. You may exclude alt contigs, decoy sequences etc with the -min-contig-size argument.
Please sign in to leave a comment.
10 comments