CallCopyRatioSegments -- how does it work?
AnsweredHello GATK team,
Thank you for making this great toolset. I am trying to figure out how the thresholds reported by the GATK tool CallCopyRatioSegments are used for actually calling the segments as neutral, amplified, or deleted. I am hoping to re-use the mean and standard deviation reported to determine what threshold should be used on a per sample basis when I convert the centered copy ratios reported at the segment level to the chromosome arm level. However, when I go to the documentation for this tool on the GATK website it only states that it is a modified version of the ReCapSeg method. The current version of the documentation no longer links to this method (older versions have a link but that link is broken) and it is not described on the CallCopyRatioSegments page. I tried to figure out how this function works by looking at the code for this tool on Github, but I don't know Java. From the what I understand of the logs and the code it is calling another script SimpleCopyRatioCaller.java which converts the input centered, logarithmic-scaled copy ratio (CR) for each segment back to the uncentered, linear copy number and then does some filtering in-part based on --neutral-segment-copy-ratio-lower-bound 0.9 and --neutral-segment-copy-ratio-upper-bound 1.1 flags provided while calculating and reporting the length weighted means and standard deviations of the remaining copy numbers which it will use to find the z-score of the copy-ratios for subsequent. I have included an example of these reported values and the corresponding output file below. As a sanity check I tried determining a call using the reported mean and standard devation to calcluate a z-score for the uncentered segment_mean for some of the segments and then use the tools z-score cutoff of 2. When I looked at these results however I noticed that the very first segment chr1:925942-1282783 in the file has a segment mean of 0.110692 and call of 0 (neutral). When I uncenter the value by adding 1 and then calculate ((0.110692+1)-1.010396)/0.025906 = 3.871 I immediately found that the z-score was greater than 2 which by my understanding should result in a call of "+" (amplified). Could you clarify how the values reported from the log file are being used to determine the Call for a segment?
Thank you!
I'm not sure it matters but I am using GATK 4.1.9.0
21:54:24.480 INFO CallCopyRatioSegments - Initializing engine
21:54:24.480 INFO CallCopyRatioSegments - Done initializing engine
21:54:24.961 INFO SimpleCopyRatioCaller - 38 segments in copy-neutral region [0.900000, 1.100000]...
21:54:24.969 INFO SimpleCopyRatioCaller - Length-weighted mean of segments in copy-neutral region (CR space): 1.005269
21:54:24.969 INFO SimpleCopyRatioCaller - Length-weighted standard deviation for segments in copy-neutral region : 0.033740
21:54:24.970 INFO SimpleCopyRatioCaller - 33 / 38 segments in copy-neutral region remain after outliers filtered using z-score threshold (2.000000)...
21:54:24.971 INFO SimpleCopyRatioCaller - Length-weighted mean for z-score calling (CR space): 1.010396
21:54:24.971 INFO SimpleCopyRatioCaller - Length-weighted standard deviation for z-score calling (CR space): 0.025906
Sample Chromosome Start End Num_Probes Call Segment_Mean
Sample1.T chr1 925942 12827830 1865 0 0.110692
Sample1.T chr1 12828497 12893474 9 - -1.307025
Sample1.T chr1 12894586 62663130 5719 0 0.070876
Sample1.T chr1 62688227 62805255 3 - -10.176457
Sample1.T chr1 62816575 143964979 3320 0 0.003106
Sample1.T chr1 144955439 156949107 2091 + 0.59806
Sample1.T chr1 156951573 248918363 6698 + 0.457546
Sample1.T chr2 41608 10443313 388 + 0.215446
Sample1.T chr2 10443490 68173417 3364 0 0.022339
Sample1.T chr2 68174735 68174802 1 - -29.884054
Sample1.T chr2 68180963 216058976 8413 0 0.004328
Sample1.T chr2 216065312 241899330 2430 0 0.073499
Sample1.T chr3 319777 48467600 3136 0 0.018004
Sample1.T chr3 48468147 49104733 343 - -0.360815
Sample1.T chr3 49108412 49375590 107 - -1.0313
Sample1.T chr3 49412427 63863790 1754 - -0.326337
Sample1.T chr3 63912599 198169852 6346 0 -0.010716
Sample1.T chr4 85731 145155996 6082 0 -0.033225
Sample1.T chr4 145159137 145165197 3 - -10.377916
Sample1.T chr4 145174661 150852943 197 0 -0.104607
Sample1.T chr4 150867671 150870606 3 - -10.002192
Sample1.T chr4 150871345 190027204 1325 0 -0.052795
Sample1.T chr5 140308 45645608 1756 + 0.375799
Sample1.T chr5 45695669 181260855 6891 0 -0.111016
Sample1.T chr6 292540 170584581 9651 0 -0.047762
Sample1.T chr7 193200 159144771 8912 0 0.026317
Sample1.T chr8 166086 116942701 4845 0 0.022325
Sample1.T chr8 117135328 145054157 1787 + 0.616072
Sample1.T chr9 14807 14823327 545 - -0.178025
Sample1.T chr9 14824025 138221328 7353 0 0.056903
Sample1.T chr10 47057 133626742 8034 - -0.212251
Sample1.T chr11 193712 4947100 960 - -0.773528
Sample1.T chr11 4954772 4955713 1 - -4.373585
Sample1.T chr11 4988212 9809002 615 - -0.928281
Sample1.T chr11 9812532 10564374 85 + 1.207107
Sample1.T chr11 10568448 30233800 1034 + 0.541924
Sample1.T chr11 30330949 61248535 1738 0 -0.000249
Sample1.T chr11 61249669 95831256 3947 0 0.065086
Sample1.T chr11 95835290 95923954 17 + 1.935455
Sample1.T chr11 95978948 134387847 2728 0 0.017333
Sample1.T chr12 66883 133234356 11121 0 0.01278
Sample1.T chr13 19173863 48412423 1648 + 0.556307
Sample1.T chr13 48452993 48477404 8 - -10.00926
Sample1.T chr13 48479998 48499760 4 - -4.852542
Sample1.T chr13 48501742 49175541 30 - -4.374675
Sample1.T chr13 49178568 49178654 1 - -29.834243
Sample1.T chr13 49185964 49186102 1 - -2.84414
Sample1.T chr13 49187122 49187190 1 - -29.738695
Sample1.T chr13 49188515 49278519 18 - -4.663135
Sample1.T chr13 49280509 49286282 2 - -16.796562
Sample1.T chr13 49291684 49460232 8 - -3.810535
Sample1.T chr13 49461097 49467960 2 - -16.527831
Sample1.T chr13 49476476 49491849 11 - -5.860275
Sample1.T chr13 49496002 108306938 1192 + 0.556199
Sample1.T chr13 108665886 114326281 529 + 0.98517
Sample1.T chr14 18601117 105529794 5959 0 0.039669
Sample1.T chr15 20534259 29570723 335 - -0.877508
Sample1.T chr15 29701595 30635630 62 - -0.162893
Sample1.T chr15 30904664 32394233 82 - -0.825191
Sample1.T chr15 32394665 32398849 5 - -2.553281
Sample1.T chr15 32446474 40030456 411 - -0.855683
Sample1.T chr15 40032169 40032801 2 - -16.310428
Sample1.T chr15 40034326 45682661 1360 - -0.814669
Sample1.T chr15 45688337 45688404 1 - -3.854157
Sample1.T chr15 45689039 58138676 894 - -0.902228
Sample1.T chr15 58166673 58887540 43 - -1.380785
Sample1.T chr15 58887560 62656086 232 - -0.876344
Sample1.T chr15 62657771 62673890 2 - -14.986952
Sample1.T chr15 62675217 70984257 901 - -0.88104
Sample1.T chr15 71008379 101923059 2360 0 0.030271
Sample1.T chr16 47430 14765413 2164 - -0.801695
Sample1.T chr16 14838407 15009759 33 0 -0.060814
Sample1.T chr16 15016129 19503617 360 - -0.846888
Sample1.T chr16 19504881 19507779 2 - -15.301485
Sample1.T chr16 19510839 28878789 972 - -0.864876
Sample1.T chr16 28879099 28879116 1 - -29.938505
Sample1.T chr16 28879501 68691926 2756 - -0.817006
Sample1.T chr16 68695255 68815759 13 - -1.899663
Sample1.T chr16 68819280 90075910 1910 - -0.777819
Sample1.T chr17 156220 21551450 3568 - -0.813481
Sample1.T chr17 27294396 36166563 1202 0 0.083687
Sample1.T chr17 36169080 36212569 9 - -1.229
Sample1.T chr17 36253954 83094575 6781 0 0.081737
Sample1.T chr18 158699 8343520 414 + 0.484465
Sample1.T chr18 8370890 10855566 165 0 -0.019479
Sample1.T chr18 10857001 12099682 44 - -0.913736
Sample1.T chr18 12102706 12512531 54 0 0.063171
Sample1.T chr18 12535475 14808579 141 - -0.884909
Sample1.T chr18 14831383 14840678 4 - -2.23181
Sample1.T chr18 14843024 24022847 272 0 0.063143
Sample1.T chr18 24064140 57731807 1240 - -0.902233
Sample1.T chr18 58044661 62560644 197 0 0.04459
Sample1.T chr18 62563009 63643577 74 + 0.498528
Sample1.T chr18 63655657 68691361 76 - -0.614551
Sample1.T chr18 68697226 68897428 13 - -1.015552
Sample1.T chr18 68897578 75288641 153 0 -0.015056
Sample1.T chr18 75410739 80247348 151 + 0.551337
Sample1.T chr19 281388 8512711 2343 - -0.749298
Sample1.T chr19 8513029 9536356 186 + 0.537415
Sample1.T chr19 9565940 10561845 341 + 1.046876
Sample1.T chr19 10562729 10797518 80 + 1.651495
Sample1.T chr19 10798486 11540021 290 0 0.135459
Sample1.T chr19 11541076 12731421 180 - -0.188717
Sample1.T chr19 12734313 13145395 186 - -0.526045
Sample1.T chr19 13149029 19713089 1687 0 0.113179
Sample1.T chr19 19714091 29526665 132 - -0.251964
Sample1.T chr19 29527211 43195273 2384 0 0.081799
Sample1.T chr19 43197997 43259135 5 - -25.601576
Sample1.T chr19 43261860 54240689 2528 0 0.096304
Sample1.T chr19 54240944 54241868 2 - -3.255788
Sample1.T chr19 54242829 58572632 983 0 0.056674
Sample1.T chr20 87710 64273600 4869 0 0.031419
Sample1.T chr21 10413667 46664374 1896 0 0.009138
Sample1.T chr22 15528159 50782310 4155 0 0.074184
-
Apologies for the confusing behavior and scant (or missing, in the case of ReCapSeg) documentation. I think the bit of code at https://github.com/broadinstitute/gatk/blob/9d5727df8db3a475b1ba5f9bff6bc92a322f5633/src/main/java/org/broadinstitute/hellbender/tools/copynumber/caller/SimpleCopyRatioCaller.java#L67 will answer your question. That first segment is considered neutral simply because it lies within the copy-neutral region defined by [0.9, 1.1] (with copy ratio 2^0.110692 ~ 1.08).
In order for a segment to be considered duplicated or deleted, it must lie outside of this region AND the z-score thresholds; all other segments are then considered neutral. It's been (quite) a while since I developed this tool, but I believe we were trying to replicate the spirit of the original ReCapSeg calculation (which was itself relatively naive), while at the same time making changes to the method to avoid undesired behavior in some cases.Just for historical interest, this tool was intended to be a sort of a placeholder until a more sophisticated method that also incorporates the allele-fraction data (which is not used here, as is alluded to in the tool documentation) was developed to replace it. Unfortunately, our development roadmap changed, and although various members of our group have experimented with prototypes, we were never able to fully develop a replacement method to our satisfaction.
-
Hi Samuel Lee,
Thank you for your response. It is very helpful and I now realize I previously was using the reported segment means improperly. That block is helpful. Just one further point of clarification. In the case of my example data above the z-score thresholds [1.010396 - 2*0.025906, 1.010396 - 2*0.025906] are within the bounds of the of the defined copy-neutral space [0.9,1.1]. In such a case the it appears that the z-score thresholds do not matter as they are only used to evaluate segments outside of the defined copy-neutral space. Is that interpretation correct?
Thanks!
-
Yes, that's correct that the z-score thresholds do not come into play. You might want to adjust the parameters to be more sensitive. If I recall, these default parameters were tuned for a clinical pipeline and are correspondingly conservative.
However, note that z-score thresholding of segments within these bounds does come into play when calculating the "Length-weighted mean for z-score calling (CR space)." The gist is that outliers in the distribution of segment means in the copy-neutral region would otherwise affect the calculation of this length-weighted mean, were z-score thresholding not performed. It may be helpful to read each line of the log output to understand the corresponding step of the calling method.
Please sign in to leave a comment.
3 comments