Steps For GATK-gCNV pipleline on Exome Data
I am a bit confused as to the steps for GATK g-CNV and therefore I am making this post to understand it. I am dealing with multiple .bams as an input (so I believe I am supposed to use cohort mode). Are the following a correct set of steps to use for exome data:
1) PreprocessIntervals
2) AnnotateIntervals
3) FilterIntervals
4) GermlineCNVCaller
5) PostProcessGermlineCNVCalls
Or would it be better to do:
1) PreprocessIntervals
2) CollectReadCounts
3) FilterIntervals
4) GermlineCNVCaller
5) PostProcessGermlineCNVCalls
or am I reading the documentation (https://gatk.broadinstitute.org/hc/en-us/articles/360035531152) incorrectly? Is there another way to do these steps?
-
Hi Y R,
Take another look at the workflow diagram in the tutorial you linked. In cohort mode, the recommended steps would be:
1) PreprocessIntervals
2) AnnotateIntervals
3) CollectReadCounts
4) FilterIntervals
5) DetermineGermlineContigPloidy
6) GermlineCNVCaller
7) PostProcessGermlineCNVCalls
You may want to read Sec. 1 of that tutorial to understand the differences when running this workflow for exomes or genomes. Essentially, you must provide targets (which will be padded to create the corresponding genomic bins for analysis) when running on exomes, whereas you simply select the bin size (which is used to create bins that tile the entire genome) when running on genomes.
-
Hi Samuel Lee I really am quite confused by the documentation so I am really thankful for your response about the recommended steps but what if I don't have a table for the ploidy step?
-
If this table cannot be easily generated/automated in a pipline can the `DetermineGermlineContigPloidy` be avoided?
-
You cannot skip this step. The contig-ploidy priors table is simply a resource file that is provided to the tool. You should only need to construct this file a single time---no need to automatically generate it. You should be able to find some additional pointers on this table if you search the forum.
-
I did look on the forum but it only shows examples and not how to construct a file specific to the batches you are running? What exactly should I do?
-
Your results should hopefully not be too sensitive to the particular values used in this table—they are just priors. Typically you would not need to construct a new table for each batch unless your data happens to be very idiosyncratic.
I would suggest trying to use a table constructed with values similar to those found in the tutorial or other forum posts. Then you should run the tool on your data and inspect the results to see if they are reasonable (e.g., CN = 2 on the majority of autosomal chromosomes, and reasonably inferred sex genotypes).
Hopefully the tutorial and forum will be good resources should you require additional assistance!
-
So then you mean use it like this:
CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_31 0.01 0.01 0.97 0.012 0.01 0.01 0.97 0.013 0.01 0.01 0.97 0.014 0.01 0.01 0.97 0.015 0.01 0.01 0.97 0.016 0.01 0.01 0.97 0.017 0.01 0.01 0.97 0.018 0.01 0.01 0.97 0.019 0.01 0.01 0.97 0.0110 0.01 0.01 0.97 0.0111 0.01 0.01 0.97 0.0112 0.01 0.01 0.97 0.0113 0.01 0.01 0.97 0.0114 0.01 0.01 0.97 0.0115 0.01 0.01 0.97 0.0116 0.01 0.01 0.97 0.0117 0.01 0.01 0.97 0.0118 0.01 0.01 0.97 0.0119 0.01 0.01 0.97 0.0120 0.01 0.01 0.97 0.0121 0.01 0.01 0.97 0.0122 0.01 0.01 0.97 0.01X 0.01 0.49 0.49 0.01Y 0.50 0.50 0.00 0.001) Is the above correct?2) What if the results are sensitive to particular values used in the table? -
Yes, that should be fine for an initial run.
As a general principle, if your analysis is sensitive to your priors, then your data is not very informative.
-
What do you mean "not very information" ? Does it only accept .tsv files?
-
Yes, this file must be in TSV format.
I was speaking very generally about Bayesian inference https://en.wikipedia.org/wiki/Bayesian_inference. If your inferences depend very sensitively on the particulars of your prior, then it's possible that your data are not informative enough to override the prior.
The point here is that it is very unlikely that your data are poor enough that e.g. choosing 0.98 instead of 0.97 in the table above would cause you to make incorrect inferences about ploidy.
I would encourage you to take some time to go over the tutorial and to experiment with the tool. This forum might not be the appropriate venue for such detailed support.
-
Wait the according to the workflow diagram I can't do the following as I would have to either do AnnotateIntervals or CollectReadCounts I can't do both (according to the diagram)?
1) PreprocessIntervals
2) AnnotateIntervals
3) CollectReadCounts
4) FilterIntervals
5) DetermineGermlineContigPloidy
6) GermlineCNVCaller
7) PostProcessGermlineCNVCalls
-
Y R you might find the corresponding WDL workflow helpful: https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_wdl/germline/cnv_germline_cohort_workflow.wdl The diagram is just an overview, but the WDL is an actual implementation and should be more clear. We highly recommend the AnnotateIntervals step, but technically it is optional.
Please sign in to leave a comment.
12 comments