ModelSegments output from Somatic CNV
Dear GATK forum,
I have performed somatic CNV analysis jointly with coverage data with ModelSegments with GATK 4.4.0.0 with hg38 with regular parameters and output from this run leads to figure 1 which shows hypersegmentation and some of the segments are log2 copy ratio is positive and while some of them are negative.
gatk --java-options "-Xmx7g" ModelSegments \
--denoised-copy-ratios tumor.denoisedCR.tsv \
--output sandbox \
--output-prefix Tumor_model_segment
So, I changed some of the parameters in modelsegment such as shown below , now hypersegmentation is resolved but the output only have few segments and all of them are deletion such as segment MEAN_LOG2_COPY_RATIO are all negative. How is it possible that all of them are negative after changing the parameters.
Please suggest.
gatk --java-options "-Xmx7g" ModelSegments \
--denoised-copy-ratios tumor.denoisedCR.tsv \
--number-of-changepoints-penalty-factor 5.0 \
--kernel-variance-copy-ratio 5.0 \
--kernel-variance-allele-fraction 5.0 \
--output sandbox \
--output-prefix model_segment_output
-
Below is our former documentation for SomaticCNV calls. Some parts may have changed in our tooling however the main idea is still there and you will find recommendations from our developers for those parameters that you are adjusting.
Regards.
-
Thank you for your reply. I did look into this recommendations and mainly at section 8.2 but none of the plots are showing in this tutorial either in Chrome or Edge.
Please suggest how can I view the plots to see effect of the parameter tuning.
-
We are working on getting those missing images in articles currently. We are sorry for the inconvenience.
Here are those plots from section 8.2
**8H. Increasing changepoints penalty factor from 1.0 to 5.0 gives 140 segments.****8I. Increasing kernel variance parameters each to 0.8 gives 144 segments.** Changing `--kernel-variance-copy-ratio` alone to 0.025 increases the number of segments greatly, to 1,266 segments. Changing it to 0.2 gives 414 segments.**8J. Decreasing kernel scaling from 1.0 to 0 gives 236 segments.** Conversely, increasing kernel scaling from 1.0 to 5.0 gives 551 segments.**8K. Increasing both smoothing parameters each from 2.0 to 10.0 gives 263 segments.** -
Thanks for the plots. I was able to get a decent number of segments by adjusting the kernel-variance-copy-ratio parameter.
But after running this, in the segment file all "MEAN_LOG2_COPY_RATIO" are all negative, none of them are positive after adjusting for kernel-variance-copy-ratio. I am expecting some of the segments will be positive. Kindly suggest.
-
I have contacted our team to get a more definitive answer to your case which will will arrive soon. Before that I have a few suggestions with response to this issue.
Hypersegmentation could be due to the nature of your sample as well. I have seen examples of such smoothing even with germline samples due to AT/GC Dropout levels of the data coming out from the sequencer. The higher the dropout levels the worse the segmentation gets therefore trying to smoothout this kind of data may result in lowering the overall copy number states of every segment you have and below is such an example image from a germline sample that I have
Due to the nature of the sample received from the sequencer germline CNV analysis resulted in extreme segmentation towards the negative end of the spectrum.
We recommend you to check overall metrics of your sequencing experiment to see if you have any outlier samples and outstanding values of such metrics.
We will be back with more expert views shortly.
Regards.
-
Thank you so much for the detail reply and will wait for further information as you have mentioned above.
Best
-
Hi Indrani Datta,
A few operational questions:
1. What kind of data are these? WGS? Targeted? If targeted, do you provide an interval list? What coverage are these samples (mean coverage for WGS, mean coverage over targets for targeted sequencing)?
2. Do you run a matched normal sample as well? A panel of normals?
3. Did you run CollectAllelicCounts on the sample? Providing an allelic counts file to ModelSegments usually helps.
A few suggestions:
1. run PlotCopyRatios GATK tool and look at the resulting MAD metrics (specifically denoised_MAD_value_tumor), they should be small.
2. Take a look at our GATK best practices WDL pipeline for somatic CNV analysis here. Specifically, the ModelSegments step contains several suggested default values; check those for some starting values. You can also use this pipeline to assess whether you might benefit from some of the other pre- or post-processing steps in the pipeline, or run it yourself in Cromshell, a free tool.
If none of this works maybe you could provide your denoisedCR.tsv to us, and we can dig into it. -
Thank you for your reply.
These are whole exome data and Q20 was above 96%. And GATK somatic CNV was ran with matched normal. The modelsegment was executed without allelic counts. Hope it helps to answer your questions.
I will look into plot Copyratios tool. And I will get back here after that.
-
Sounds good. Can you verify if you provided an interval list? I believe the tool should raise an error if it wasn't but if not it could explain the high variance.
If you'd like to try providing allelic counts, those are calculated with the CollectAllelicCounts GATK tool. I've never run the command without allelic counts so I have no basis for comparison, but this may improve segmentation.
-
I have used "reference_hg38_bge_exome_calling_regions.v1.1.interval_list" for the interval list.
Is it possible to call "CallCopyRatioSegments" after adding CollectAllelicCounts, I thought
"CallCopyRatioSegments" works after coverage based call only i.e without adding allelic counts.
Thank you.
-
Indira good question... the allelic counts go into ModelSegments, which uses the allelic counts to help with the segmentation. It outputs a copy ratio and an allele fraction per segment, and CallCopyRatioSegments uses only the copy ratio values to make its best guess at thresholds for amplifcations and deletions.
Basically the allelic counts are used for the segmentation but not the calls. -
I also might have missed your answer to this, but did you use a panel of normals for these data?
-
No, I don't have panel of Normals. I used matched normal.
And I looked into standarizedMAD.txt file, it is 0.3, does this look fine?
Thanks.
-
It is recommended that you use both a panel of normals and a matched normal, but I'm not sure that would explain the anomalies. It could explain the negative copy ratios though: is your mean target coverage lower for the tumor than for the matched normal?
Edit: that MAD value is on the higher end but in the range of values I've seen on clinical data, particularly for saliva samples.
-
Q20 is slightly higher in matched normal than tumor. This is not a clinical sample though.
Thanks
-
I'm looking into whether that can shift the baseline.
Can I ask you if the mean of your denoised copy ratios TSV is negative? If so the issue is a step before ModelSegments; if not then ModelSegments itself is being funky. -
Thank you for your reply. I just checked the denoised copy ratio mean and it is negative.
Please suggest if any other corrections needed for model segment call.
-
Ok this makes more sense now. Depending on the magnitude of this difference:
* If it's small (mean log2 copy ratio > -0.1): the GATK caller (CallCopyRatioSegments) should adjust to center its calls around the sample mean, and call amplifications/deletions accordingly.
* If it's large ( < 0.1): this either indicates that coverage is way lower for the second sample, or possibly some normalization issue involving the lack of a panel of normals. Without seeing the data firsthand I can't narrow it down further, but hopefully this gives you a starting point.
If you'd like to provide plots of the data or ask anything else about your results, I'll continue to check this thread. -
Thanks for your reply again. When you say the magnitude of the difference ,do you mean the difference between mean log2 copy ratio between sample and its normal.
Is there any direct email you could share then I can share the denoise copy ratio file with you.
Best.
-
So sorry I was unclear, I meant the mean of the denoised log2 copy ratios.
Let's see how you can get your data files over to me. -
Hi, would you mind providing me with the number for the mean denoised log2 copy ratio you metioned before?
Here's how to provide data files if you still feel you're getting unsatisfactory results: https://gatk.broadinstitute.org/hc/en-us/articles/360035889671-How-do-I-submit-a-detailed-bug-report -
One more question, did you enable GC correction on your sample?
-
The mean log2 copy ratio is -3.080777967. I didn't turn on any GC correction while running
GATK, do you mean any other steps we need to do GC correction.
Thanks.
-
That's quite low. My guess is your sample has very low coverage. You can try looking at the sample and matched normal in IGV (https://igv.org/app/) to check this, looking at the relative coverage in target regions.
-
Thanks. I will look into IGV installation then and reply back here.
-
I created a panel of Normals with couple of matched Normals from the cohort and ran one sample with this panel of Normals. The hyper segmentation didn't resolve with this panel of Normals either.
Thanks
-
Ok good to know.
1. Did you check the sample's depth relative to the matched normal/other samples?2. Have you run any other samples through this pipeline, and are you getting the same results? I recommend you try running the matched normal and looking at the mean copy ratios as well as the segmentation plot coming from ModelSegments.
-
1. There is not a lot difference in read coverage between sample and normal.
2. I ran several other samples and they have similar hyper segmentation. This is how one of the normal plot looks like and mean copy ratio is zero.
Thanks
-
I see, thank you for sharing. I see two possible issues:
Sample QC is insufficient. If so:
- Q20 might not be helpful here when evaluating read depth for short reads; I'd suggest looking at mean read depth in targeted regions. For targeted sequencing we would like it to be above 30, preferably above 60-90x.
- Additionally, check GC/AT dropout metrics for these samples.
The way the tool is being deployed is insufficient. If you're getting such low copy ratio values and also oversegmentation, double check a few things:
- basics: the correct interval list for your targeted sequencing protocol, the correct reference, and the alignment looks good (recommend IGV for visualizing this).
- check your files from each step: look at the GATK best practices pipeline I linked to above (here) and double check all your parameters against what we use to run all the GATK processing steps (specifically PreprocessIntervals, CollectReadCounts, DenoiseReadCounts). Check the output files for each step: that preprocessing the intervals does what you expect, that you are getting reasonable values for the output of CollectReadCounts, and compare raw read counts to the denoised readcounts file to look for any changes.
- modifications to your process: run CollectAllelicCounts to detect het sites and look at minor allele fraction values (your normal should have MAF values at 0.5). I'd recommend using your allelic counts as inputs to ModelSegments. And
- run a different CNV calling tool: cnMOPS, Manta, etc. If one of those tools works for you, then that might be a better fit for your workflow.
Hopefully any of those steps will yield some clues.
-
I did ran GATK depthofcoverage on these sample/normal.
sample_id,total,mean,granular_third_quartile,granular_median,granular_first_quartile,%_bases_above_15
TUMOR,4090171663,13.95,9,1,1,21.0sample_id,total,mean,granular_third_quartile,granular_median,granular_first_quartile,%_bases_above_15
NORMAL,5208816548,17.76,12,1,1,23.3Thanks for all suggestions and I will look into other cnv calling tools too.
Please sign in to leave a comment.
32 comments