Panel of normals for CNV exome analysis using WGS datasets
"GATK v4.4.0.0
Hello,
I'm currently in the process of setting up a GATK pipeline for germline CNV calling in exome samples, following the guidelines outlined in the GATK best practices. The workflow mentions the requirement of a panel of normals (PoN) file, which necessitates the inclusion of 30 healthy samples sequenced in the same run using the same library kit. Unfortunately, I do not have the necessary samples to create this PoN file.
My question is whether it's possible to utilize publicly available exome or whole-genome sequencing (WGS) datasets to generate the PoN file. If so, could you please provide a link to these datasets (both exome and WGS) if available?"
-
Hi Krishna
Creating a normal panel is not the only option for germline CNV calling workflow. Actually you may use all your files and call CNVs in the cohort mode as well. If you have around 30 samples you can run them in cohort mode which also generates a model. After that post processing your samples will generate CNV calls.
About using publicly available exome and genome sets, using genome sets is doable but not exome sets since public data sets for exomes may not be using the same capture kit as you have therefore you will be dealing with lots of false positive or negative calls due to differences between capture kits. For genomes this is not an issue since you will be sequencing the whole genome without any capture bias.
I hope this helps.
-
Thank you for the reply Gökalp Çelik
I'm curious to know if it's feasible to employ publicly accessible whole-genome datasets to create a panel of normals (PoN) file and subsequently utilize it for calling copy number variations (CNVs) in exome samples.
-
Hi Krishna
Since 2 data types are fundamentally different from each other in areas like coverage/read distribution and depth that approach may not be too feasible.
If you are in need of more samples for exome CNV calling here are a few more suggestions from my personal experience
1- You may even perform cohort level calls using samples as low as 10. Not very optimal for long term goals but you will be able to get your results with a little more filtering. Some of the common events may end up showing as unique CNVs in the joint call file but you may be able to annotate calls using various databases such as DECIPHER to obtain the frequency of such CNVs in already published data.
2- You may try using whole exome data from public resources such as 1000 Genome project however since the coverage and target regions are most likely different from what you already have used for your own samples you may need to perform some homework before using those samples along with your cohort. The most obvious solution is to generate a intersection of target regions covered by public exome datasets and your capture kit to obtain a common consensus which will remove regions that are not covered by public data as well as your data. Remaining regions may be used to call CNVs
I hope these will help.
-
Thank you so much Gökalp Çelik
Please sign in to leave a comment.
4 comments