Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Parallelizing Mutect2

0

10 comments

  • Avatar
    Bhanu Gandham

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    There is no Spark version of Mutect2, nor is there any multiprocessing on a single machine.  The Mutect2 WDL and featured workflow on Terra parallelize by scattering over multiple machines, and the Google cloud compute cost is quite low.

    0
    Comment actions Permalink
  • Avatar
    vctrymao

    I noticed from this link (https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2) that one can parallelize Mutect2 over chromosomes using the command:

     

    for chromosome in {1..22}; do 
    gatk Mutect2 -R ref.fasta -I tumor.bam -L $chromosome --f1r2-tar-gz ${chromosome}-f1r2.tar.gz -O ${chromosome}-unfiltered.vcf
    done

    Two questions:

    1. Is there a reason why this doesn't include the sex chromosomes? It should still work for those, right?

    2. If I have both male and female BAMs, do I have to separate them? Or can I include the Y chromosome in parallelizing the female BAMs (and if I do, will Mutect ignore it)?

     

     

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    That link is just a pedagogical example showing how to handle the -f1r2-tar-gz output for a scattered job.  The best way is scatter is through the featured workflow, which handles merging of not only the scattered VCFs but also F1R2 tables, bamouts, pileup summaries for calculating contamination etc.

    To answer your second question, you may run male and female samples with the same intervals and scatters.  The female samples will simply yield VCFs with no records in the Y chromosome, which is fine.

    0
    Comment actions Permalink
  • Avatar
    vctrymao

    It was my understanding that GATK usually uses interval lists to parallelize the workflow. I remember seeing that GATK has these available for WGS and Broad-based sequencing (https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists), but in the event that we aren't using these methods, how should we proceed?

    I also see that the scattered workflow on Github (https://github.com/gatk-workflows/gatk4-somatic-snvs-indels/blob/master/mutect2.wdl) that SplitIntervals is used. Do we need an interval list to use this feature?

    For example, if I have WES sequencing done without interval lists, can I still use SplitIntervals on the relevant reference file only to parallelize calling? 

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    vctrymao Providing an interval list via the -L argument does not parallelize the GATK.  Rather it tells the tool what intervals to process sequentially.

    An interval list is optional in the Mutect2 WDL.  If you do not provide one the pipeline splits the reference into chunks with roughly equal sizes (in base pairs, not by contig, eg if you scatter into 200Mb chunks one scatter is 1:1-200,000,000, the next contains 1:200,000,000-260,000,000 and 2:1-140,000,000, and the last one contains a bit of X, all of Y, and all the little decoy and alt contigs).

    0
    Comment actions Permalink
  • Avatar
    vctrymao

    To follow up on this, is an interval list required for HaplotypeCaller? If so, why is it required there and not for Mutect?

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Interval lists are optional for both tools.

    0
    Comment actions Permalink
  • Avatar
    vctrymao

    And SplitIntervals should work for HaplotypeCaller too then, without interval lists?

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Yes, if given no intervals SplitIntervals splits the entire reference.  You may exclude alt contigs, decoy sequences etc with the -min-contig-size argument.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk