Calculation method of the global_read_depth.tsv file of the DetermineGermlineContigPloidy tool
AnsweredHello,
I'm writing to you because I'd like to know more about the calculation method of the global_read_depth.tsv file obtained in the DetermineGermlineContigPloidy tool calls and in particular the GLOBAL_READ_DEPTH column.
Indeed, I tried to find this value by averaging the counts obtained in the CollectReadCounts tool, but I end up with results that are on average 2 times higher in favor of counting the number of reads per interval of the count file by CollectReadCounts.
Do these two values represent different things?
Thank you, regards.
-
Hi Tintest,
Yes, these values represent different things. CollectReadCounts is the raw integer counts of the reads overlapping intervals. DetermineGermlineContigPloidy calls contig level ploidies and also generates global noise data.
The final gCNV result is from the output of GermlineCNVCaller and then post processing. Are you following this tutorial? The gCNV method is quite complex and we are continuing to improve it. Please let us know if your final results are not consistent with your expected results and then we may be able to look into if there is some sort of bug. It is much more complex than averaging the counts from the CollectReadCounts tool.
Hope this helps!
Genevieve
-
Hello Genevieve,
I followed the tutorial and I am very satisfied with how the gCNV calling pipeline works and my results are very interesting. I only wish to have more precision on the methodology of obtaining the file global_read_depth.tsv and more particularly its GLOBAL_READ_DEPTH column.
According to the tutorial, GLOBAL_READ_DEPTH is the average depth value over all the intervals of the calling target (see picture), so by averaging my CollectReadCounts values obtained with the calling target, I expected to find this value. However, this is not the case. Should it be or is this value calculated differently?
Thank you, regards.
-
Hi Tintest,
Thanks for your question. The depth parameter is defined so that the expected number of reads in each bin is, roughly, depth x systematic bias in the bin x copy number in the bin. This parameter corresponds to the quantity `d_s` in the white paper here: https://github.com/broadinstitute/gatk/blob/master/docs/CNV/germline-cnv-caller-model.pdf
I would expect the depth parameter to be roughly the average number of reads in each bin / ploidy, although the distribution of the per-bin biases will ultimately determine this relation. (Not sure if this squares with the quantities you describe; perhaps I'm misunderstanding something, but it seems like you have described the same quantity in two slightly different ways?)
Hope that clears things up!
-
Hello,
Thank you very much for your explanations. I have one last question. Is it possible to use this value to measure differences in read depth between different samples ?
For example, I created two different models with two cohorts that were generated using two different wet lab protocols. Rather than doing coverage or read depth comparisons on the bam files, can I use this value to plot the GLOBAL_READ_DEPTH distribution of all my samples from each cohort to see which one is more deeply sequenced on the areas useful for CNV calling taking into account the capture bias ?
Thank you, regards.
-
Hi Tintest
Yes, this parameter is basically supposed to represent overall depth. If your model fits are both good across the two cohorts, you could reasonably compare the distributions of this parameter across them. The depth alone won't tell the whole story (as things like the relative degree of variability in the capture bias will also affect the quality of the calls), but it's good to sanity check it nonetheless.
-
Hello,
Thank you for your answer. I have one last thing to ask you. I tried to implement strategies to compare the distribution of different WES data between them and deduce which model would be more appropriate for which type of data, but I didn't get anything conclusive. Would you have any tips or examples of things you've implemented for this purpose ?
Regards.
-
Hi Tintest,
If I understand correctly, you want to determine which of the two models should be used with incoming samples in case mode? One approach we've used in the past is to use a simple "preclustering" step by running PCA in conjunction with a suitable clustering algorithm on the read counts (or even just on a subset of the bins, also making sure the counts are suitably centered/normalized as appropriate for PCA---typically taking a log transform and then centering on per-bin and per-sample medians gives reasonable results). This usually suffices to roughly separate samples into clusters for which separate gCNV models can be fit. Incoming samples can then be PCA transformed and clustered in the same way, and the appropriate model can be used to call them.
This procedure may need to be customized depending on the characteristics of your samples, target kits, etc., so we don't provide a standard workflow for it. But hopefully this rough description is enough to get you started down the right path!
Please sign in to leave a comment.
7 comments