Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Steps For GATK-gCNV pipleline on Exome Data

0

12 comments

  • Avatar
    Samuel Lee

    Hi Y R,

    Take another look at the workflow diagram in the tutorial you linked. In cohort mode, the recommended steps would be:

    1) PreprocessIntervals

    2) AnnotateIntervals

    3) CollectReadCounts

    4) FilterIntervals

    5) DetermineGermlineContigPloidy

    6) GermlineCNVCaller

    7) PostProcessGermlineCNVCalls

    You may want to read Sec. 1 of that tutorial to understand the differences when running this workflow for exomes or genomes. Essentially, you must provide targets (which will be padded to create the corresponding genomic bins for analysis) when running on exomes, whereas you simply select the bin size (which is used to create bins that tile the entire genome) when running on genomes.

    0
    Comment actions Permalink
  • Avatar
    Y R

    Hi Samuel Lee I really am quite confused by the documentation so I am really thankful for your response about the recommended steps but what if I don't have a table for the ploidy step?

    0
    Comment actions Permalink
  • Avatar
    Y R

    If this table cannot be easily generated/automated in a pipline can the `DetermineGermlineContigPloidy` be avoided?

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    You cannot skip this step. The contig-ploidy priors table is simply a resource file that is provided to the tool. You should only need to construct this file a single time---no need to automatically generate it. You should be able to find some additional pointers on this table if you search the forum.

     

    0
    Comment actions Permalink
  • Avatar
    Y R

    I did look on the forum but it only shows examples and not how to construct a file specific to the batches you are running? What exactly should I do?

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Your results should hopefully not be too sensitive to the particular values used in this table—they are just priors. Typically you would not need to construct a new table for each batch unless your data happens to be very idiosyncratic.

    I would suggest trying to use a table constructed with values similar to those found in the tutorial or other forum posts. Then you should run the tool on your data and inspect the results to see if they are reasonable (e.g., CN = 2 on the majority of autosomal chromosomes, and reasonably inferred sex genotypes).

    Hopefully the tutorial and forum will be good resources should you require additional assistance!

    0
    Comment actions Permalink
  • Avatar
    Y R

    So then you mean use it like this:

     

    CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3
    1 0.01 0.01 0.97 0.01
    2 0.01 0.01 0.97 0.01
    3 0.01 0.01 0.97 0.01
    4 0.01 0.01 0.97 0.01
    5 0.01 0.01 0.97 0.01
    6 0.01 0.01 0.97 0.01
    7 0.01 0.01 0.97 0.01
    8 0.01 0.01 0.97 0.01
    9 0.01 0.01 0.97 0.01
    10 0.01 0.01 0.97 0.01
    11 0.01 0.01 0.97 0.01
    12 0.01 0.01 0.97 0.01
    13 0.01 0.01 0.97 0.01
    14 0.01 0.01 0.97 0.01
    15 0.01 0.01 0.97 0.01
    16 0.01 0.01 0.97 0.01
    17 0.01 0.01 0.97 0.01
    18 0.01 0.01 0.97 0.01
    19 0.01 0.01 0.97 0.01
    20 0.01 0.01 0.97 0.01
    21 0.01 0.01 0.97 0.01
    22 0.01 0.01 0.97 0.01
    X 0.01 0.49 0.49 0.01
    Y 0.50 0.50 0.00 0.00
     
    1) Is the above correct?
    2) What if the results are sensitive to particular values used in the table?
    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Yes, that should be fine for an initial run.

    As a general principle, if your analysis is sensitive to your priors, then your data is not very informative.

    0
    Comment actions Permalink
  • Avatar
    Y R

    What do you mean "not very information" ? Does it only accept .tsv files?

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Yes, this file must be in TSV format.

    I was speaking very generally about Bayesian inference https://en.wikipedia.org/wiki/Bayesian_inference. If your inferences depend very sensitively on the particulars of your prior, then it's possible that your data are not informative enough to override the prior.

    The point here is that it is very unlikely that your data are poor enough that e.g. choosing 0.98 instead of 0.97 in the table above would cause you to make incorrect inferences about ploidy.

    I would encourage you to take some time to go over the tutorial and to experiment with the tool. This forum might not be the appropriate venue for such detailed support.

    1
    Comment actions Permalink
  • Avatar
    Y R

    Wait the according to the workflow diagram I can't do the following as I would have to either do AnnotateIntervals or CollectReadCounts I can't do both (according to the diagram)?

     

    1) PreprocessIntervals

    2) AnnotateIntervals

    3) CollectReadCounts

    4) FilterIntervals

    5) DetermineGermlineContigPloidy

    6) GermlineCNVCaller

    7) PostProcessGermlineCNVCalls

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    Y R you might find the corresponding WDL workflow helpful: https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_wdl/germline/cnv_germline_cohort_workflow.wdl  The diagram is just an overview, but the WDL is an actual implementation and should be more clear.  We highly recommend the AnnotateIntervals step, but technically it is optional.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk