Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Wrong allosomal ploidies from DetermineGermlineContigPloidy.

0

11 comments

  • Avatar
    Samuel Lee

    Hi Weichi,

    Apologies that you are running into difficulties, but thank you for this very clear and informative report!

    The DetermineGermlineContigPloidy tool attempts to fit a very simple model that is based on the total number of read counts in each contig per sample. (Note that a table of these counts is actually output as an intermediate, temporary file by the tool, which you might be able to grab and analyze.) So our goal will be to get the model to differentiate between CN=2 female samples and CN=1 male samples, essentially using the total number of counts in the bins on chrX (while also accounting for the overall depth and the autosomal contig-level CNs for each sample). (Your case is a little unusual in that you have no data on chrY, but I think the model should still be able to perform inference correctly.)

    We thus have two options: 1) filter or preprocess the count data so that it is better accomodated by the model, or 2) adjust the model hyperparameters to encourage the desired fit.

    Unfortunately, I cannot give you detailed recommendations without actually seeing your data. But I might suggest that you use your python analysis scripts to look at these total contig-level counts, if you have not already. Ideally, if you plot a histogram of the total counts on chrX roughly normalized by sample-level depth (which you can hopefully estimate from the count data as well), you should see two clear components corresponding to male and female.

    If not, you might want to try to understand if filtering some of the intervals (e.g., using FilterIntervals, as you mentioned) might provide a path forward; for example, although it seems unlikely, one might imagine a case in which your chrX bins are dominated by very noisy intervals that might be throwing off the total counts and smearing out these two components. It might also be worth checking the behavior of the intervals in the PAR to make sure nothing looks suspicious.

    If this approach does not work, then I would start to explore adjusting hyperparameters. I would probably start with the global-psi-scale hyperparameter (which essentially sets the scale of contig-level variance we expect in the counts), but again, it's difficult to say for sure if this will help without seeing your data.

    0
    Comment actions Permalink
  • Avatar
    Weichi

    Hi Samuel Lee,

    Thank you for your response.
    I obtained a temporary TSV file, named "samples-by-coverage-per-contig", from DetermineGermlineContigPloidy and used Excel to create an XY plot.
    The coverage values for the chrX were obtained directly from the TSV file. However, I am unsure whether I should normalize these values by the coverage of the autosomal chromosomes.
    Additionally, upon visualizing the data using the XY plot, I observed two distinct clusters that correspond to the male and female samples.
    However, the results of the ploidy in chrX from DetermineGermlineContigPloidy remained 2 for all samples.
     
    Furthermore, I noticed that the PLOIDY_GQ of the X chromosome is lower in males than in females.
    The PLOIDY_GQ of chrX
    Female max:10.57, min:9.20; Male max:8.31, min:6.23

    I have attached three plots that illustrate the results of my analysis.
    The first and second plot is generated by Excel with the data from the TSV file.
     
    The third plot is the observation that I mentioned before.
    The original count values were generated using GATK4 CollectCount and then transformed into z-scores. This was done by first computing the z-score for each sample and then computing z-scores for each interval.
     

     

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Thanks Weichi! That data looks pretty clean, so I think we should be able to get a good result. Perhaps some hyperparameters might need to be adjusted?

    Would you mind sharing that samples-by-coverage-per-contig TSV so I can take a direct look? You can remove any sample IDs, if necessary, just the count values would be fine.

    0
    Comment actions Permalink
  • Avatar
    Weichi

    Hi Samuel Lee

    Sure, here is the samples-by-coverage-per-contig TSV file: https://drive.google.com/file/d/1wK1a-6YiDIOEzxP0dqJ4PMXwuCkw0FvP/view?usp=share_link 

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Thanks Weichi! And oops, one more thing—can you share your intervals file?

    0
    Comment actions Permalink
  • Avatar
    Weichi

    Hi, Samuel Lee,

    Here is the link:https://drive.google.com/drive/folders/1-zT-GFeNlJs_zg8Uve5e50uKYo2ema2M?usp=share_link 

    It contains samples-by-coverage-per-contig TSV file, intervals_list files, and contig_ploidy_priors TSV file.

    Thanks for your kind support.

    0
    Comment actions Permalink
  • Avatar
    Weichi

    I have another question: Would the length of the interval affect the sensitivity of CNV calling?

    The intervals I mentioned before are my target bed file. After the PreprocessIntervals step, it creates an interval_list file (what I provided) that has 6321 intervals ranging in length from 221 bp to 8037 bp (median: 719 bp, mean:727 bp). I am wondering whether I should split the interval that is higher than 1000bp (because the recommended bin size in WGS is 1000bp).

    0
    Comment actions Permalink
  • Avatar
    Weichi

    As I am unsure how to tune the parameters, here are some initial tests.
    Except for the chrX in 0.0001(male and female got different results), all samples got the same result in each global-psi-scale test.

    Would you please let me know where I could find some reference material to learn about how the global-psi-scale works?

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Hi Weichi,

    Thanks for these additional questions and test results! Apologies for the delay, I took some vacation over the past few days but will return to looking at this later today.

    The distribution of interval length typically should not strongly affect this ploidy-level modeling step, but may affect your ability to resolve breakpoints in the subsequent GermlineCNVCaller modeling step. You might indeed want to break up your interval list---it's possible that using the bin-length option in the PreprocessIntervals tool could help you here.

    For all the details on the gCNV model and its parameters, you may want to refer to the Online Methods section of the preprint at https://www.biorxiv.org/content/10.1101/2022.08.25.504851v1.full.pdf.

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Hi Weichi

    I think that if you further crank global-psi-scale down you should start to get the desired results. For example, I set it to 1E-6 (which is probably extreme), and this yielded ploidy = 1 calls on chrX without inducing aneuploidies elsewhere.

    As you can see from the paper, this parameter controls the amount of "unexplained variance" or "overdispersion" that we allow from contig-level variance in the count data. That is, when this parameter is small, we are more likely to think that fluctuations in the count data come from real CNV signal---e.g., from copy number due to differing sex genotypes---and not unexplained systematics. However, when this parameter is large, it then tends to absorb any such fluctuations, causing the model to prefer copy-number states according to the prior. You can thus think of this as a parameter that controls the sensitivity of the model.

    0
    Comment actions Permalink
  • Avatar
    Weichi

    Hi, Samuel Lee,

    Thank you for your kind support. I also tried to set it to 2E-6 and obtained the same result as you. However, when I set it to 3E-6, it yielded ploidy=3 calls on male chrX. 
    I learned a lot through this discussion, thanks.

     

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk