GermlineCNVCaller - Cohort mode - samples size recommendation.
GATK version used: 4.4.0.0
gatk GermlineCNVCaller \
--run-mode COHORT
Hi there,
I'm looking for a recommendation on the number of samples to use when building the COHORT model for GermlinCNVCaller.
I see within the documentation a recommendation of 200 samples, for resource usage reasons, but was wondering if above 200 samples would create a better COHORT model for later CASE usage, as well as initial COHORT samples CNV calling? Would more samples always be better, or is there a number where additional samples afterward would provide no additional benefit to the COHORT model algorithm?
Hope you can help.
Eddie
-
Hi Eddie Ip
Although having infinite many samples to generate a model is the ultimate idea, there are diminishing returns after a certain number of samples reached especially due to the amount of resources needed to complete the model. My personal experience also indicated me that around 200 samples you don't observe any added benefit. I tried this with a clincal exome kit that I was using regularly. After about 180 samples added to the model my case results don't seem to bring any additional true positives or eliminate any false positives or negatives.
Regards.
-
Thanks for the information SkyWarrior.
-
Hi @SkyWarrior! I'm currently working on determining the appropriate number of samples to use for model generation, and I have a pool of over 300 WES samples to choose from.
After reading your comment about using a clinical exome kit and reaching a point where additional samples didn't improve the results, I'm curious about how you validated your models, and determined the number of true positives and false negatives. Did you have access to a publicly available or in-house truth set for comparison? I would greatly appreciate it if you could provide more detail on your methodology.
Thanks!
-
Hi Ram
All my samples are in-house clinical samples validated using other orthogonal methods. I routinely check the results of all the cohort samples once new members are added and I observe the changes in the calls in all of the known truth samples that I have which are about 60 (Also increasing in numbers over time when more confirmations are done) of those already confirmed using MLPA, ArrayCGH etc. above 200 samples I did not observe any additional positive calls or missing calls in those samples therefore I think my model pretty much reached a plateau. I also QC check my samples and try not to include ones that deviate too much in terms of depth, AT/GC Dropout and zero coverage target percentage. Those deviants usually cause your model to shift from a nice convergence to an absolute chaos.
I hope this helps.
-
Thanks SkyWarrior! That was indeed very helpful!
Please sign in to leave a comment.
5 comments