A Panel of Normal or PON is a type of resource used in somatic variant analysis. Depending on the type of variant you're looking for, the PON will be generated differently. What all PONs have in common is that (1) they are made from normal samples (in this context, "normal" means derived from healthy tissue that is believed to not have any somatic alterations) and (2) their main purpose is to capture recurrent technical artifacts in order to improve the results of the variant calling analysis.
As a result, the most important selection criteria for choosing normals to include in any PON are the technical properties of how the data was generated. It's very important to use normals that are as technically similar as possible to the tumor (same exome or genome preparation methods, sequencing technology and so on). Additionally, the samples should come from subjects that were young and healthy to minimize the chance of using as normal a sample from someone who has an undiagnosed tumor. Normals are typically derived from blood samples.
There is no definitive rule for how many samples should be used to make a PON (even a small PON is better than no PON) but in practice we recommend aiming for a minimum of 40.
At the Broad Institute, we typically make a standard PON for a given version of the pipeline (corresponding to the combination of all protocols used in production to generate the sequence data, starting from sample preparation and including the analysis software) and use it to process all tumor samples that go through that version of the pipeline. Because we process many samples in the same way, we are able to make PONs composed of hundreds of samples.
Variant type-specific recommendations are given below.
Short variants (SNVs and indels)
For short variant discovery, the PON is created by running the variant caller Mutect2 individually on a set of normal samples and combining the resulting variant calls with some criteria (e.g. excluding any sites that are not present in at least 2 normals) as defined in the Best Practices documentation. This produces a sites-only VCF file that can be used as PON for Mutect2.
Copy Number Variants
For CNV discovery, the PON is created by running the initial coverage collection tools individually on a set of normal samples and combining the resulting copy ratio data using a dedicated PON creation tool. This produces a binary file that can be used as PON.
Public GATK Panel of Normals
Public GATK panels of normals available to download as part of the GATK resource bundle.
for hg38, access them using
gs://gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz
for hg19/b37, access them using
gs://gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf
orgs://gatk-best-practices/somatic-b37/Mutect2-WGS-panel-b37.vcf
More more information about publically available resources are available here.
5 comments
Hi,
The listed PON's are vcf format, but the GATK copy number pipelines require an hdf5 file. Is there another processing step we need to do before we can use these files?
Thanks.
Dear GATK Team,
I am using mutect2 for somatic mutation identification from MMRF data which contains multiple myeloma (MM) samples from 5 different ethnicities. I have the only tumor, and corresponding matched normal for around 1004 MM WES samples. As per the documentation of PON,
1. PON should be created from healthy normals with an undiagnosed tumor, which I don't have.
2. Secondly, MM is a very heterogeneous disease that has a unique mutational signature and clonal evolution history for each ethnicity.
So is it preferable to use PON for somatic mutation identification or rely on tumors and matched normals only? Kindly suggest.
Hi GATK team
Does the PoN also help to remove additional germline variants which might be missed in a sample due to low coverage in matched normal or absence of matched normal ?
OR do you think the germline resource provided in form of 1000genome or dbSNP vcf is sufficient for germline variant removal ?
I ask this because I am trying to figure out if we should create a PoN with normal samples coming from different ethnicities to remove germline variants effectively.
Hi GATK team,
Thank you so much for the explanation. I am working with a tool to call mutation from RNA-seq data and would like to use the PoN to filter out any sequencing artifacts.
Will it be possible to add the PoN that was used for the RNA mutect development (https://github.com/broadinstitute/RNA_MUTECT_1.0-1) to the gatk-best-practices bucket?
Many thanks,
Eila
Dear GATK team,
Thank you for this thread. I want to ask:
I am applying the GATK Mutect2 over TargetSequecing over a set of genes and some genomic regions of interest without having PON of the the sample. The reference genome used is the hg19. The best option for the PON in this case would be the exome or the WGS version. What would be the best practice?
Best Regards, Manuel
Please sign in to leave a comment.