Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Advice how to split exome samples for germline CNV

0

3 comments

  • Avatar
    Laura Gauthier

    Hi Michelle Lian,

    I've been trying to track down a public version of our python notebook for his purpose, but we don't have anything that's quite ready to publish yet.  Hopefully shortly after the new year!  In the meantime, I can tell you that we use a pretty basic process to do our cohort splitting.  We mainly leverage methods in the scikit-learn package.  We do a PCA (https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html#sphx-glr-auto-examples-decomposition-plot-pca-3d-py), typically with up to 15 components to make sure we're capturing most of the variation, but you could get away with just 2. (We also typically ask the PCA method to do the scaling for us.) Then you can run a clustering method on the PCs for all the samples.  I like DBSCAN because it does outlier selection for you, but you may still need the tweak the epsilon value.  (We start with 0.5).  We also like good ol' K-means or agglomerative/hierarchical.  They each have different pros and cons: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

    It doesn't matter which samples get assigned as case or cohort as long as they fit nicely in the same cluster.  gCNV only calls rare events, so they shouldn't be common enough to make a big impact on the cohort model. I wouldn't suggest using more than 200 samples in a cohort, but you can assign the rest as cases.  30 seems a little low, but if that's what you have then you can try it. Otherwise, putting some of the affected samples into the cohort model is fine. We try to balance XY and XX samples in the cohort.  Do remove relate samples if you have that info.  If not, I've heard good things about the KING tool.

    I'm not sure that we've done a thorough comparison of different mappability tracks, but it's optional and we don't use it by default.

    Hopefully that's helpful!

    0
    Comment actions Permalink
  • Avatar
    Michelle Lian

    Hi @Laura Gauthier,

    Thank you so much for the clarifications. I'm more sure what to do for qns #2 - #5. As for #1, I have more questions to seek your expert insights. 

    a) considering there are 2 different capture kits, do I separate the samples into 2 groups first (by capture kits) before PCA?

    b) what matrix information may I use for PCA? 
    For example, run CollectReadCounts for all intervals of each capture kit -> obtain the average coverage across all intervals (per sample) -> matrix to feed into PCA (where each line are the respective samples, 1 column only for average coverage across all intervals (of the capture kits)? That said, I'll have 2 matrices to run PCAs, one for each capture kit & the samples sequenced with. Or there are other matrix info to try with?

     

    Thank you for your time in clarifying my doubts again. 

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    I'd suggest you keep the capture kits separate if you have enough samples for each that that makes sense.  To get a little more detailed, there are enough "bias factors" in the model that we think in theory it should be able to represent a combination of capture kits, but that depends on how variable the coverage is.  We haven't really done any experiments to show this.

    We have a specific interval file that's selected to help distinguish between capture kits in unlabeled projects, but since you know what captures you have I'd suggest just using the CollectReadCounts output over each capture's targets and doing two separate PCAs like you said.  It shouldn't be _too_ computationally intensive.  And I think there's a mini-batch PCA option in scikit-learn if scale is a problem.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk