Suitability of GermlineCNVCaller for targeted sequening dataset
AnsweredHi GATK Team,
(running GATK 4.2.0.0 from docker container)
I was hoping you may be able to provide some advice. I have been asked to perform a CNV analysis of a targeted sequencing data set of case/control design. My concern is that the targeted regions are highly variable in length ranging from 168bp to 22kbp; regions sum to a total length of 152kbp.
I have broadly followed the steps described in this article:
https://gatk.broadinstitute.org/hc/en-us/articles/360035531152
calling variation in the case samples against models generated from the controls.
My question is firstly, is this tool appropriate to this dataset? Particularly given the small amount of the genome covered and the large variability in region size. I note that in the tool documentation it states 'For WES and WGS, we recommend no less than 10000 consecutive intervals spanning at least 10 - 50 mb.'
Secondly, if it is reasonable, how should the tool parameters be configured with respect to class and cnv coherence length? I have used the parameters below, setting each to 150bp (i.e. within the size of the smallest interval). However, this is a dramatic departure from the default 10,000, and so I'd like to make sure I haven't completely misunderstood!
--class-coherence-length 150 \ --cnv-coherence-length 150 \ --interval-psi-scale 1.0E-6 \ --log-mean-bias-standard-deviation 0.01 \ --sample-psi-scale 1.0E-6 \
I know that parameter choice has a large impact upon results, and so would like to get an idea if I'm in the right ballpark, or if my settings are totally inappropriate for purpose.
Many thanks in advance,
Kevin
-
Hi Kevin Donnelly,
It definitely would be worth trying the gCNV method depending on how many targets you have. We haven't done a lot of testing running gCNV on anything more sparse than whole exome sequencing data. But it may still get good results even with a few hundred targets.
You would want to turn off the bias factors and potentially the GC correction because it would be too many parameters for too few data points. Then you will also want to test and adjust the priors such as p_alt, p_active, interval_psi_scale, potentially others.
Hope this helps you figure out the method that will work for your data! If you find any information that might help future users, please post here!
Best,
Genevieve
-
Hi Kevin Donnelly,
We are also working with gCNV on sequencing data from a targeted gene panel. We wanted to ask you, since our analyses sound very similar: did you manage to gain any insights about the parameter recommendations for this kind of data? We would very much appreciate any information you have. Thank you in advance!
Best,
Lourdes -
Hi Lourdes Rosano,
I do apologise for the very late reply on this! We decided that it may not be the most suitable tool for our particular data, and instead opted for the R package 'panelcn.MOPS', which operates by comparing the normalised coverage for a given sample/region with a panel of controls from the same cohort:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5518446/
This ran quickly and allowed us to compare intervals of varying lengths. All the very best with your analysis.
Cheers,
Kevin
Please sign in to leave a comment.
3 comments