Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomAD files for MuTect2

Answered
0

12 comments

  • Avatar
    Tiffany Miller

    Hi Felix! I believe you can use the files as is if you are following the best-practices tutorial as steps further in the pipeline will handle reducing to PASS variants, but I will double-check this. Here are some notes on this file (posted previously by David Benjamin ):

    The gnomAD VCF [if you take it from the gnomad site] is enormous because it contains a lot of INFO field annotations, none of which Mutect2 needs except for AF (allele frequency in the population).  The AF only gnomad that we provide in the best practices google bucket is the gnomAD VCF with all extraneous annotations removed.  In principle you could use gnomAD with all the annotations, but it would waste a lot of CPU time parsing the VCF.

    Also, you may appreciate this workspace that shows how to run Mutect2 in a cloud platform called Terra: https://app.terra.bio/#workspaces/help-gatk/Somatic-SNVs-Indels-GATK4 It has instructions on how to run the workflows and has the workflows configured to all the resources needed for it (though if you are using hg38 you'd need to add that reference table). If you are interested in using the platform and have any questions, let me know. We support Terra as well. 

     

    0
    Comment actions Permalink
  • Avatar
    Felix

    Hi Tiffany,

    Thank you for your reply! I have experimented with the original gnomAD VCF and also noticed the file size problem. It is a great idea to improve the speed by using the AF-only files.

    My only problem is whether the AF-only files must only contain PASS reads. I have skimmed the source code of the mutect2.wdl (https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect2.wdl), but I haven't found any step that would filter the AF-only VCFs for PASS reads.

    I may use Terra in the future, but for now I am trying to build my own Mutect2 pipeline on our local servers. I am just taking the mutect2.wdl as an example.

    Have a nice day,

    Felix

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hi Felix! Sorry, I don't have clearer news yet. I've messaged a few folks and will hopefully have a good answer soon.

    0
    Comment actions Permalink
  • Avatar
    Felix

    Thank you for the update, Tiffany.

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Alright Felix, I found out the gnomad file should only have passing variants in it. Please restrict them for your use. I am going to get that prioritized on our side. Thank you for asking this question!

    0
    Comment actions Permalink
  • Avatar
    Felix

    Thank you very much for your help, Tiffany! I will prepare my files accordingly.

    0
    Comment actions Permalink
  • Avatar
    Vincent Appiah

    Hi All, I have 14  whole exome samples (each from a different patient). At the moment I don't have any normal sample files. So based on the discussions from gatk best practices and other forums, I settled on an approach. I would be grateful if you could advice if my approach below will be suitable

    1. I call the individual samples separately with Mutect2 using 1000g_pon.hg38.vcf.gz  and af-only-gnomad.hg38.vcf.gz files (available at https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38;tab=objects?prefix)

    2. Perform downstream analysis ( eg. annotation, etc)

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Vincent Appiah, the GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Vincent Appiah That is correct.

    0
    Comment actions Permalink
  • Avatar
    Anitha R

    Hello, As I am new to the GATK pipeline, I have some questions regarding the supplement file used for the somatic variant filtering step.

        I call individual samples separately using Mutect2 with the files 1000g_pon.hg38.vcf.gz and af-only-gnomad.hg38.vcf.gz, which can be found at the GATK storage (available at https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38;tab=objects?prefix).
        For downstream analysis (filtering the variants), there are two approaches: filtering directly with one command using FilterMutectCalls, or following three steps: GetPileupSummaries, CalculateContamination, and FilterMutectCalls. Which method should I choose? (https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2)
        If I opt for the three-step process, during the GetPileupSummaries step, I understand that both -V and -L files are required. The -V file should be a biallelic VCF, and the -L file can be a .bed or .interval_list file. Many forums suggest using af-only-gnomad.hg38.vcf.gz and somatic-hg38_small_exac_common_3.hg38.vcf.gz for both -V and -L.

    I am confused about which files to use and whether a BED file is necessary. If it is, how can I create a BED file?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Anitha R

    We discussed this very same issue here. 

    https://gatk.broadinstitute.org/hc/en-us/community/posts/5447277150107-GetPileupSummaries-common-germline-variant-sites-VCF-hg38 

    You don't need to use a separate bed file for collecting pileup summaries. You can use the same vcf file that you use to designate common sites to collect pileup summaries. So use the same file for both -V and -L. 

    Collecting pileup summaries and contamination further activates additional filters present within FilterMutectCalls tool. If you omit those inputs, contamination filter and a basic copy number analysis won't be performed to designate proper somatic variant filtration. 

    I hope this clarifies the issue. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Anitha R

    Thank you, Gökalp Çelik for the clarification.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk