Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GermlineCNVCaller - Cohort mode - samples size recommendation.

0

5 comments

  • Avatar
    SkyWarrior

    Hi Eddie Ip

    Although having infinite many samples to generate a model is the ultimate idea, there are diminishing returns after a certain number of samples reached especially due to the amount of resources needed to complete the model. My personal experience also indicated me that around 200 samples you don't observe any added benefit. I tried this with a clincal exome kit that I was using regularly. After about 180 samples added to the model my case results don't seem to bring any additional true positives or eliminate any false positives or negatives. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Eddie Ip

    Thanks for the information SkyWarrior.

    0
    Comment actions Permalink
  • Avatar
    Ram

    Hi @SkyWarrior! I'm currently working on determining the appropriate number of samples to use for model generation, and I have a pool of over 300 WES samples to choose from.

    After reading your comment about using a clinical exome kit and reaching a point where additional samples didn't improve the results, I'm curious about how you validated your models, and determined the number of true positives and false negatives. Did you have access to a publicly available or in-house truth set for comparison? I would greatly appreciate it if you could provide more detail on your methodology.

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi Ram

    All my samples are in-house clinical samples validated using other orthogonal methods. I routinely check the results of all the cohort samples once new members are added and I observe the changes in the calls in all of the known truth samples that I have which are about 60 (Also increasing in numbers over time when more confirmations are done) of those already confirmed using MLPA, ArrayCGH etc. above 200 samples I did not observe any additional positive calls or missing calls in those samples therefore I think my model pretty much reached a plateau. I also QC check my samples and try not to include ones that deviate too much in terms of depth, AT/GC Dropout and zero coverage target percentage. Those deviants usually cause your model to shift from a nice convergence to an absolute chaos. 

    I hope this helps. 

    1
    Comment actions Permalink
  • Avatar
    Ram

    Thanks SkyWarrior! That was indeed very helpful!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk