Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Calculation method of the global_read_depth.tsv file of the DetermineGermlineContigPloidy tool

Answered
0

7 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Tintest,

    Yes, these values represent different things. CollectReadCounts is the raw integer counts of the reads overlapping intervals. DetermineGermlineContigPloidy calls contig level ploidies and also generates global noise data.

    The final gCNV result is from the output of GermlineCNVCaller and then post processing. Are you following this tutorial? The gCNV method is quite complex and we are continuing to improve it. Please let us know if your final results are not consistent with your expected results and then we may be able to look into if there is some sort of bug. It is much more complex than averaging the counts from the CollectReadCounts tool.

    Hope this helps!

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Tintest

    Hello Genevieve,

    I followed the tutorial and I am very satisfied with how the gCNV calling pipeline works and my results are very interesting. I only wish to have more precision on the methodology of obtaining the file global_read_depth.tsv and more particularly its GLOBAL_READ_DEPTH column.

    According to the tutorial, GLOBAL_READ_DEPTH is the average depth value over all the intervals of the calling target (see picture), so by averaging my CollectReadCounts values obtained with the calling target, I expected to find this value. However, this is not the case. Should it be or is this value calculated differently?

    Thank you, regards.

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Hi Tintest,

    Thanks for your question. The depth parameter is defined so that the expected number of reads in each bin is, roughly, depth x systematic bias in the bin x copy number in the bin. This parameter corresponds to the quantity `d_s` in the white paper here: https://github.com/broadinstitute/gatk/blob/master/docs/CNV/germline-cnv-caller-model.pdf 

    I would expect the depth parameter to be roughly the average number of reads in each bin / ploidy, although the distribution of the per-bin biases will ultimately determine this relation. (Not sure if this squares with the quantities you describe; perhaps I'm misunderstanding something, but it seems like you have described the same quantity in two slightly different ways?)

    Hope that clears things up!

    0
    Comment actions Permalink
  • Avatar
    Tintest

    Hello,

    Thank you very much for your explanations. I have one last question. Is it possible to use this value to measure differences in read depth between different samples ?

    For example, I created two different models with two cohorts that were generated using two different wet lab protocols. Rather than doing coverage or read depth comparisons on the bam files, can I use this value to plot the GLOBAL_READ_DEPTH distribution of all my samples from each cohort to see which one is more deeply sequenced on the areas useful for CNV calling taking into account the capture bias ?

    Thank you, regards.

     

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Hi Tintest

    Yes, this parameter is basically supposed to represent overall depth. If your model fits are both good across the two cohorts, you could reasonably compare the distributions of this parameter across them. The depth alone won't tell the whole story (as things like the relative degree of variability in the capture bias will also affect the quality of the calls), but it's good to sanity check it nonetheless.

    0
    Comment actions Permalink
  • Avatar
    Tintest

    Hello,

    Thank you for your answer. I have one last thing to ask you. I tried to implement strategies to compare the distribution of different WES data between them and deduce which model would be more appropriate for which type of data, but I didn't get anything conclusive. Would you have any tips or examples of things you've implemented for this purpose ?

    Regards.

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Hi Tintest,

    If I understand correctly, you want to determine which of the two models should be used with incoming samples in case mode? One approach we've used in the past is to use a simple "preclustering" step by running PCA in conjunction with a suitable clustering algorithm on the read counts (or even just on a subset of the bins, also making sure the counts are suitably centered/normalized as appropriate for PCA---typically taking a log transform and then centering on per-bin and per-sample medians gives reasonable results). This usually suffices to roughly separate samples into clusters for which separate gCNV models can be fit. Incoming samples can then be PCA transformed and clustered in the same way, and the appropriate model can be used to call them.

    This procedure may need to be customized depending on the characteristics of your samples, target kits, etc., so we don't provide a standard workflow for it. But hopefully this rough description is enough to get you started down the right path!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk