Panel of Normals DocumentationAnswered
I had some questions about the pre-made panel of normals (PON) files in the Google Cloud Best Practices folder (https://console.cloud.google.com/storage/browser/gatk-best-practices/). It would be great if some documentation regarding these files could be provided.
Specifically, would it be possible to clarify the difference between:
"Mutect2-WGS-panel-b37.vcf" and "Mutect2-exome-panel.vcf" are these files both panel of normal files for use in Mutect2?
Additionally, in the "somatic-hg38" folder there are files titled "1000g_pon.hg38.vcf" and "af-only-gnomad.hg38.vcf". Are these files also panel of normals files for use in Mutect? Would it be possible to describe information these files contain (information from exomes or genomes, etc)? Additionally, I noticed that there does not appear to be an equivalent of the "Mutect2-WGS-panel-b37.vcf" for the hg38 genome build. Is it currently possible to run these analyses using hg38?
Any advice would be greatly appreciated!
Ryan Gimple "Mutect2-WGS-panel-b37.vcf" and "Mutect2-exome-panel.vcf" are a whole-genome and exome panel, respectively, each generated from several hundred normals sequenced with standard Broad Genomocis Platform protocols 4-5 years ago. Because most errors caught by the panel of normals are mapping artifacts these are still useful despite changes in sequencing technology. "1000g_pon.hg38.vcf" is an hg38 panel of normals for both exomes and whole genomes generated from 1000 Genomes Project samples. Finally, "af-only-gnomad.hg38.vcf" is a copy of the gnomAD VCF stripped of all unnecessary INFO fields. It is used for the -germline-resource argument.
Thanks for the information! I am using hg38 for my analyses. To be able to use the "Mutect2-WGS-panel-b37.vcf" file for my pipeline, do I need to perform a liftover step to hg38, or does this file already exist somewhere in either of the resource bundles? Or should I only use the "1000g_pon.hg38.vcf" file?
You should use the 1000 Genomes hg38 panel. Since hg38 is superior to hg19 and has fewer alignment artifacts, lifting-over an hg19 panel would mean lifting over mapping artifacts that don't exist in hg38. I'm sure there are other reasons, too.
Hi @David Benjamin!
I am trying to use Mutect2-exome-panel.vcf, from somatic-b37 directory, but I have a doubt:
As you said: "Mutect2-WGS-panel-b37.vcf" and "Mutect2-exome-panel.vcf" are a whole-genome and exome panel, respectively, each generated from several hundred normals", but when I read the files, I see that INFO field says SOMATIC for each variant existing there. Why are they classified as somatic? Shouldn't they be errors?
This is a relic of how those panels were generated. Mutect2 does not do anything with INFO fields in the panel of normals, so it won't affect variant calls.
Hi David Benjamin,
I saw that the 1000g_pon.hg38.vcf.gz file is almost three years old. Is it still safe to use it with the current version of MuTect2?
Yes, it is safe to use with the current version of Mutect2.
Thank you, David!
I have 2 questions about the germline resources files. I want to run Mutect2 for my exome samples (tumor and normal, aligned to hg19 reference) and, after a search in the Google cloud files (https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-b37%2F;tab=objects?prefix=&forceOnObjectsSortingFiltering=false), I found these 2 files:
My question is, which one should I use? My second question is that, given that I am working with hg19, will these b-37 files work for me? Or should I do a liftover?
Thank you very much in advance!
Hi, please note that your question was posted while the GATK Team was Out of Office.
Please repost any outstanding GATK issues and we will get to them if possible. Our first priority is solving GATK issues and abnormal results, see our support policy for more details.
Following on from Ryan's question:
I currently have two cohorts I am running exome data for, one is a UK cohort and the other is from Africa. The UK cohort has ~60 samples with a matched normal, but after creating my panel of normals the tumour only samples still have several hundred variants called even after stringent downstream filtering based on read depth, number of variant reads etc. Conversely, I only have eight matched normal samples for the African cohort, definitely not enough to create a robust PON.
Because of this, I was wanting to use the GATK Mutect2-exome-panel.vcf to see if this would aid my analysis, my question is does the Mutect2-exome-panel.vcf contain sequencing artefacts only, or does it also contain SNPs like a normal PON would? I'm a bit worried that applying this PON to my cohorts may be inappropriate due to the differences in SNPs between populations, assuming this PON was generated using a US population.
Thank you in advance for your help.
Hi, David Benjamin
When I open files titled "1000g_pon.hg38.vcf", I found that 'filtering_status=Warning: unfiltered Mutect 2 calls. Please run FilterMutectCalls to remove false positives.' and 'tumor_sample=HG02775'. Is "1000g_pon.hg38.vcf" an hg38 panel of normals or vcf file of one normal sample (HG02775)?
Thanks for your consideration. I look forward to hearing from you.
Hi Alex Blain,
Yes, the Mutect2-exome-panel.vcf also contains SNPs like a typical panel of normals. In regards to applying this PON to the cohort from a different region, this should work fine. Here is a resource about the improvements to the genome files making them more representative of regional haplotype differences:
I hope this answered your question.
Hi ming hu,
The 1000g_pon.hg38.vcf file is an hg38 panel of normals file not just a vcf for one sample. I hope this helps.
Hi GATK team,
Like some users above, I am using the 1000g_pon.hg38.vcf as a panel of normals for tumor samples where I do not have a 'normal' sample to analyse with mutect2. It would be very useful to have some more background information for this panel:
* Is this a whole genome or whole exome panel?
* Data from how many individuals was used to build it?
* If this panel would be derived from a whole exome variant calling, what is the effect when using it with whole genome tumor samples? Do you loose any variants outside the exomes or does mutect2 just assumes the panel has no mutations outside the exomes?
Thank you for dealing with my questions!
The 1000g_pon.hg38.vcf file is from the 1000 Genomes Project samples, which you can read more about here: https://www.internationalgenome.org/. We personally did not collect the data, but the data is publicly available.
It's from WGS not WES which is why the panel of normals can be used with both WES and WGS somatic analysis.
Hope this helps!
Please sign in to leave a comment.