GenomAD files for MuTect2
AnsweredHello all,
I am new to GATK and I am trying to perform somatic variant calling following this tutorial: https://gatk.broadinstitute.org/hc/en-us/articles/360035531132
I am confused which af-only-genomAD files should be used for the tutorial. Apparently, the genomAD file must only contain PASS variants: https://gatkforums.broadinstitute.org/gatk/discussion/comment/58618/
It sounds like this applies to the genomAD file for MuTect2 (-germline-resource) and to the biallelic version for gatk GetPileupSummaries. However, unlike mentioned in the thread, the af-only-gnomad.hg38.vcf.gz and small_exac_common_3.hg38.vcf.gz from https://console.cloud.google.com/storage/browser/_details/gatk-best-practices/somatic-hg38/ contain also variants which did not pass all filters.
--> somatic-hg38_small_exac_common_3.hg38.vcf.gz (md5: 4ac7593efd401234654fdf87ab1b5ef1)
# variants Filter
301 AC_Adj0_Filter
2588 InbreedingCoeff_Filter
13 VQSRTrancheINDEL99.50to99.90
4 VQSRTrancheINDEL99.90to99.95
14 VQSRTrancheINDEL99.95to100.00
673 VQSRTrancheSNP99.60to99.80
610 VQSRTrancheSNP99.80to99.90
324 VQSRTrancheSNP99.90to99.95
222 VQSRTrancheSNP99.95to100.00
--> somatic-hg38_af-only-gnomad.hg38.vcf.gz (md5: 30d500cc0f4adc640ddbb25eb341c89d)
# variants Filter
227532 InbreedingCoeff
Can these files be used for the MuTect2 workflow or do I have to restrict them to PASS variants?
Best,
Felix
-
Hi Felix! I believe you can use the files as is if you are following the best-practices tutorial as steps further in the pipeline will handle reducing to PASS variants, but I will double-check this. Here are some notes on this file (posted previously by David Benjamin ):
The gnomAD VCF [if you take it from the gnomad site] is enormous because it contains a lot of INFO field annotations, none of which Mutect2 needs except for AF (allele frequency in the population). The AF only gnomad that we provide in the best practices google bucket is the gnomAD VCF with all extraneous annotations removed. In principle you could use gnomAD with all the annotations, but it would waste a lot of CPU time parsing the VCF.
Also, you may appreciate this workspace that shows how to run Mutect2 in a cloud platform called Terra: https://app.terra.bio/#workspaces/help-gatk/Somatic-SNVs-Indels-GATK4 It has instructions on how to run the workflows and has the workflows configured to all the resources needed for it (though if you are using hg38 you'd need to add that reference table). If you are interested in using the platform and have any questions, let me know. We support Terra as well.
-
Hi Tiffany,
Thank you for your reply! I have experimented with the original gnomAD VCF and also noticed the file size problem. It is a great idea to improve the speed by using the AF-only files.
My only problem is whether the AF-only files must only contain PASS reads. I have skimmed the source code of the mutect2.wdl (https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect2.wdl), but I haven't found any step that would filter the AF-only VCFs for PASS reads.
I may use Terra in the future, but for now I am trying to build my own Mutect2 pipeline on our local servers. I am just taking the mutect2.wdl as an example.
Have a nice day,
Felix
-
Hi Felix! Sorry, I don't have clearer news yet. I've messaged a few folks and will hopefully have a good answer soon.
-
Thank you for the update, Tiffany.
-
Alright Felix, I found out the gnomad file should only have passing variants in it. Please restrict them for your use. I am going to get that prioritized on our side. Thank you for asking this question!
-
Thank you very much for your help, Tiffany! I will prepare my files accordingly.
-
Hi All, I have 14 whole exome samples (each from a different patient). At the moment I don't have any normal sample files. So based on the discussions from gatk best practices and other forums, I settled on an approach. I would be grateful if you could advice if my approach below will be suitable
1. I call the individual samples separately with Mutect2 using 1000g_pon.hg38.vcf.gz and af-only-gnomad.hg38.vcf.gz files (available at https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38;tab=objects?prefix)
2. Perform downstream analysis ( eg. annotation, etc)
-
Hi Vincent Appiah, the GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
-
Vincent Appiah That is correct.
-
Hello, As I am new to the GATK pipeline, I have some questions regarding the supplement file used for the somatic variant filtering step.
I call individual samples separately using Mutect2 with the files 1000g_pon.hg38.vcf.gz and af-only-gnomad.hg38.vcf.gz, which can be found at the GATK storage (available at https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38;tab=objects?prefix).
For downstream analysis (filtering the variants), there are two approaches: filtering directly with one command using FilterMutectCalls, or following three steps: GetPileupSummaries, CalculateContamination, and FilterMutectCalls. Which method should I choose? (https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2)
If I opt for the three-step process, during the GetPileupSummaries step, I understand that both -V and -L files are required. The -V file should be a biallelic VCF, and the -L file can be a .bed or .interval_list file. Many forums suggest using af-only-gnomad.hg38.vcf.gz and somatic-hg38_small_exac_common_3.hg38.vcf.gz for both -V and -L.I am confused about which files to use and whether a BED file is necessary. If it is, how can I create a BED file?
-
Hi Anitha R
We discussed this very same issue here.
You don't need to use a separate bed file for collecting pileup summaries. You can use the same vcf file that you use to designate common sites to collect pileup summaries. So use the same file for both -V and -L.
Collecting pileup summaries and contamination further activates additional filters present within FilterMutectCalls tool. If you omit those inputs, contamination filter and a basic copy number analysis won't be performed to designate proper somatic variant filtration.
I hope this clarifies the issue.
Regards.
-
Thank you, Gökalp Çelik for the clarification.
Please sign in to leave a comment.
12 comments