Wrong allosomal ploidies from DetermineGermlineContigPloidy.
REQUIRED for all errors and issues:
a) GATK version used: 4.2.0.0
b) Exact command used:
gatk --java-options "-Xmx6g" DetermineGermlineContigPloidy \
-L intervals \
--input ~{sep=" --input " readCountFiles} \
--contig-ploidy-priors ~{contigPloidyPriors} \
--interval-merging-rule OVERLAPPING_ONLY \
--output "out" \
--output-prefix ~{cohortEntityID} \
--verbosity DEBUG \
`# all parameters set default here` \
--mean-bias-standard-deviation ~{default="0.01" mean_bias_standard_deviation} \
--mapping-error-rate ~{default="0.01" mapping_error_rate} \
--global-psi-scale ~{default="0.001" global_psi_scale} \
--sample-psi-scale ~{default="0.0001" sample_psi_scale}
c) Entire program log: NA
I am currently adapting the gCNV pipeline to my target panel.
The test dataset includes 86 samples comprising 46 males and 40 females.
However, I have encountered an issue with the DetermineGermlineContigPloidy tool, as it fails to call allosomal ploidies accurately.
Specifically, the tool assigns female ploidy to all samples.
Upon analyzing the read count files generated by GATK4 CollectCount, I can use the python package seaborn.clustermap to identify two distinct clusters corresponding to males and females. (The read counts were first transformed into z-scores before analysis.)
My question is whether there are any parameters that can be adjusted in DetermineGermlineContigPloidy to improve its accuracy in calling allosomal ploidies?
My target panel comprises 1.9M in size, with 6973 intervals ranging in length from 220 bp to 7694 bp (median: 237 bp, mean:276 bp). The panel includes all chromosomes from chr1 to chrX, excluding chrY and chrM. On chrX, there are 698 intervals(totaling 195109 bp), including 8 located within PAR1 (totaling 2417 bp). And I didn't remove any intervals by FilterIntervals.
-
Hi Weichi,
Apologies that you are running into difficulties, but thank you for this very clear and informative report!
The DetermineGermlineContigPloidy tool attempts to fit a very simple model that is based on the total number of read counts in each contig per sample. (Note that a table of these counts is actually output as an intermediate, temporary file by the tool, which you might be able to grab and analyze.) So our goal will be to get the model to differentiate between CN=2 female samples and CN=1 male samples, essentially using the total number of counts in the bins on chrX (while also accounting for the overall depth and the autosomal contig-level CNs for each sample). (Your case is a little unusual in that you have no data on chrY, but I think the model should still be able to perform inference correctly.)
We thus have two options: 1) filter or preprocess the count data so that it is better accomodated by the model, or 2) adjust the model hyperparameters to encourage the desired fit.
Unfortunately, I cannot give you detailed recommendations without actually seeing your data. But I might suggest that you use your python analysis scripts to look at these total contig-level counts, if you have not already. Ideally, if you plot a histogram of the total counts on chrX roughly normalized by sample-level depth (which you can hopefully estimate from the count data as well), you should see two clear components corresponding to male and female.
If not, you might want to try to understand if filtering some of the intervals (e.g., using FilterIntervals, as you mentioned) might provide a path forward; for example, although it seems unlikely, one might imagine a case in which your chrX bins are dominated by very noisy intervals that might be throwing off the total counts and smearing out these two components. It might also be worth checking the behavior of the intervals in the PAR to make sure nothing looks suspicious.
If this approach does not work, then I would start to explore adjusting hyperparameters. I would probably start with the global-psi-scale hyperparameter (which essentially sets the scale of contig-level variance we expect in the counts), but again, it's difficult to say for sure if this will help without seeing your data.
-
Hi Samuel Lee,
Thank you for your response.I obtained a temporary TSV file, named "samples-by-coverage-per-contig", from DetermineGermlineContigPloidy and used Excel to create an XY plot.The coverage values for the chrX were obtained directly from the TSV file. However, I am unsure whether I should normalize these values by the coverage of the autosomal chromosomes.Additionally, upon visualizing the data using the XY plot, I observed two distinct clusters that correspond to the male and female samples.However, the results of the ploidy in chrX from DetermineGermlineContigPloidy remained 2 for all samples.Furthermore, I noticed that the PLOIDY_GQ of the X chromosome is lower in males than in females.
The PLOIDY_GQ of chrXFemale max:10.57, min:9.20; Male max:8.31, min:6.23
I have attached three plots that illustrate the results of my analysis.The first and second plot is generated by Excel with the data from the TSV file.The third plot is the observation that I mentioned before.The original count values were generated using GATK4 CollectCount and then transformed into z-scores. This was done by first computing the z-score for each sample and then computing z-scores for each interval. -
Thanks Weichi! That data looks pretty clean, so I think we should be able to get a good result. Perhaps some hyperparameters might need to be adjusted?
Would you mind sharing that samples-by-coverage-per-contig TSV so I can take a direct look? You can remove any sample IDs, if necessary, just the count values would be fine.
-
Hi Samuel Lee
Sure, here is the samples-by-coverage-per-contig TSV file: https://drive.google.com/file/d/1wK1a-6YiDIOEzxP0dqJ4PMXwuCkw0FvP/view?usp=share_link
-
Thanks Weichi! And oops, one more thing—can you share your intervals file?
-
Hi, Samuel Lee,
Here is the link:https://drive.google.com/drive/folders/1-zT-GFeNlJs_zg8Uve5e50uKYo2ema2M?usp=share_link
It contains samples-by-coverage-per-contig TSV file, intervals_list files, and contig_ploidy_priors TSV file.
Thanks for your kind support.
-
I have another question: Would the length of the interval affect the sensitivity of CNV calling?
The intervals I mentioned before are my target bed file. After the PreprocessIntervals step, it creates an interval_list file (what I provided) that has 6321 intervals ranging in length from 221 bp to 8037 bp (median: 719 bp, mean:727 bp). I am wondering whether I should split the interval that is higher than 1000bp (because the recommended bin size in WGS is 1000bp).
-
As I am unsure how to tune the parameters, here are some initial tests.
Except for the chrX in 0.0001(male and female got different results), all samples got the same result in each global-psi-scale test.Would you please let me know where I could find some reference material to learn about how the global-psi-scale works?
Thanks!
-
Hi Weichi,
Thanks for these additional questions and test results! Apologies for the delay, I took some vacation over the past few days but will return to looking at this later today.
The distribution of interval length typically should not strongly affect this ploidy-level modeling step, but may affect your ability to resolve breakpoints in the subsequent GermlineCNVCaller modeling step. You might indeed want to break up your interval list---it's possible that using the bin-length option in the PreprocessIntervals tool could help you here.
For all the details on the gCNV model and its parameters, you may want to refer to the Online Methods section of the preprint at https://www.biorxiv.org/content/10.1101/2022.08.25.504851v1.full.pdf.
-
Hi Weichi
I think that if you further crank global-psi-scale down you should start to get the desired results. For example, I set it to 1E-6 (which is probably extreme), and this yielded ploidy = 1 calls on chrX without inducing aneuploidies elsewhere.
As you can see from the paper, this parameter controls the amount of "unexplained variance" or "overdispersion" that we allow from contig-level variance in the count data. That is, when this parameter is small, we are more likely to think that fluctuations in the count data come from real CNV signal---e.g., from copy number due to differing sex genotypes---and not unexplained systematics. However, when this parameter is large, it then tends to absorb any such fluctuations, causing the model to prefer copy-number states according to the prior. You can thus think of this as a parameter that controls the sensitivity of the model.
-
Hi, Samuel Lee,
Thank you for your kind support. I also tried to set it to 2E-6 and obtained the same result as you. However, when I set it to 3E-6, it yielded ploidy=3 calls on male chrX.
I learned a lot through this discussion, thanks.
Please sign in to leave a comment.
11 comments