Mutect2 parallelize input
Can you please provide
a) GATK version used 184.108.40.206
b) Exact GATK commands (question about options)
c) The entire error log (n/a)
I've been running Mutect2 for a while now, currently 220.127.116.11. Just today while I was looking at how to split the runs so that I could run in parallel on multiple genomic regions in parallel, I realized that the documentation is a bit unclear on things and I wanted to clairify.
The documention reads thus:
--intervals,-L:String One or more genomic intervals over which to operate This argument may be specified 0 or more times. Default value: null.
- Can the "String" above be a file (i.e. BED or .Picard Intervals) or is this intended to JUST be textual genomic coordinates? I note that -XL has similar documentation. I think this is a "File or coordinates" but I'm making sure.
- In the tutorial: https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2 it uses "1..22" as the intervals for the split. Will Mutect accept that if my genome labels the contigs "chr1", "chr2" etc? I note that the mitochondrial mode uses "chrM" there.
We're running on targeted and exome sequencing, so my hope is that I can do the following:
gatk Mutect2 --intervals $CAPTURE_INTERVALS_FILE --intervals $CONTIG_NAME --intervals-set-rule INTERSECTION [...]
I.e. our $CAPTURE_INTERVALS_FILE can be a .BED or a .picard.intervals with genome-wide coordinates, and the $CONTIG_NAME would be something like "chr1", "chr2", etc (or "1","2",etc depending on Q2 above). And the "INTERSECTION" rule would intersect as a set (so all chr1 intervals in the $CAPTURE_INTERVALS_FILE would be included as a group).
Take a look at this doc:https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists
As Bhanu said, interval files various formats are valid for the -L argument. The "chr1" naming convention is fine — all that matters is that your reads, reference, and other inputs have the same contig names.
Your example using --intervals-set-rule INTERSECTION will work, but the standard practice is to use SplitIntervals to subdivide your $CAPTURE_INTERVALS_FILE into multiple intervals files (that is, multiple files, each with multiple intervals, and the union of these files' intervals is the original intervals), then run Mutect2 with these subdivided interval files.
I've made a workflow that divides the regions using SplitIntervals --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION
and I think I have it working. I'm able to reconstruct the original intervals file exactly.
It should be expected that there will be some small differences in the variant calling, merged stats, and F1R2 artifact tables, right?
in a "toy" sample I made with only reads which fall into one SplitInterval, the results are byte-for-byte identical with the exception that the MergeVcfs seems to pick one of the ##GATKCommandLine lines at random, but in the simulated case with whole-exome data, we have some small differences in the calls and read counts are sometimes off by one.
alanhoyle Those kind of small differences are expected as the consequence of bounding assembly regions differently.
Please sign in to leave a comment.