Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

ModelSegments output from Somatic CNV

0

32 comments

  • Avatar
    Gökalp Çelik

    Hi Indrani Datta

    Below is our former documentation for SomaticCNV calls. Some parts may have changed in our tooling however the main idea is still there and you will find recommendations from our developers for those parameters that you are adjusting. 

    https://gatk.broadinstitute.org/hc/en-us/articles/360035890011--How-to-part-II-Sensitively-detect-copy-ratio-alterations-and-allelic-segments 

    Regards.

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    Thank you for your reply. I did look into this recommendations and mainly at section 8.2 but none of the plots are showing in this tutorial either in Chrome or Edge. 

    Please suggest how can I view the plots to see effect of the parameter tuning.

     

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    We are working on getting those missing images in articles currently. We are sorry for the inconvenience. 

    Here are those plots from section 8.2 

    **8H. Increasing changepoints penalty factor from 1.0 to 5.0 gives 140 segments.**
    **8I. Increasing kernel variance parameters each to 0.8 gives 144 segments.** Changing `--kernel-variance-copy-ratio` alone to 0.025 increases the number of segments greatly, to 1,266 segments. Changing it to 0.2 gives 414 segments.
     
    **8J. Decreasing kernel scaling from 1.0 to 0 gives 236 segments.** Conversely, increasing kernel scaling from 1.0 to 5.0 gives 551 segments.
     
     
    **8K. Increasing both smoothing parameters each from 2.0 to 10.0 gives 263 segments.**
    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    Thanks for the plots. I was able to get a decent number of segments by adjusting the kernel-variance-copy-ratio parameter.

    But after running this, in the segment file all "MEAN_LOG2_COPY_RATIO" are all negative, none of them are positive after adjusting for kernel-variance-copy-ratio. I am expecting some of the segments will be positive. Kindly suggest.

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Indrani Datta

    I have contacted our team to get a more definitive answer to your case which will will arrive soon. Before that I have a few suggestions with response to this issue. 

    Hypersegmentation could be due to the nature of your sample as well. I have seen examples of such smoothing even with germline samples due to AT/GC Dropout levels of the data coming out from the sequencer. The higher the dropout levels the worse the segmentation gets therefore trying to smoothout this kind of data may result in lowering the overall copy number states of every segment you have and below is such an example image from a germline sample that I have

    Due to the nature of the sample received from the sequencer germline CNV analysis resulted in extreme segmentation towards the negative end of the spectrum. 

    We recommend you to check overall metrics of your sequencing experiment to see if you have any outlier samples and outstanding values of such metrics.

    We will be back with more expert views shortly. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    Thank you so much for the detail reply and will wait for further information as you have mentioned above.

    Best

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    Hi Indrani Datta, 

    A few operational questions: 
    1. What kind of data are these? WGS? Targeted? If targeted, do you provide an interval list? What coverage are these samples (mean coverage for WGS, mean coverage over targets for targeted sequencing)?
    2. Do you run a matched normal sample as well? A panel of normals? 
    3. Did you run CollectAllelicCounts on the sample? Providing an allelic counts file to ModelSegments usually helps. 

    A few suggestions: 
    1. run PlotCopyRatios GATK tool and look at the resulting MAD metrics (specifically denoised_MAD_value_tumor), they should be small. 
    2. Take a look at our GATK best practices WDL pipeline for somatic CNV analysis here. Specifically, the ModelSegments step contains several suggested default values; check those for some starting values. You can also use this pipeline to assess whether you might benefit from some of the other pre- or post-processing steps in the pipeline, or run it yourself in Cromshell, a free tool.

    If none of this works maybe you could provide your denoisedCR.tsv to us, and we can dig into it.

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    Thank you for your reply.

    These are whole exome data and Q20 was above 96%. And GATK somatic CNV was ran with matched normal. The modelsegment was executed without allelic counts. Hope it helps to answer your questions.

    I will look into plot Copyratios tool. And I will get back here after that.

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    Sounds good. Can you verify if you provided an interval list? I believe the tool should raise an error if it wasn't but if not it could explain the high variance.

    If you'd like to try providing allelic counts, those are calculated with the CollectAllelicCounts GATK tool. I've never run the command without allelic counts so I have no basis for comparison, but this may improve segmentation.

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    I have used "reference_hg38_bge_exome_calling_regions.v1.1.interval_list" for the interval list.

    Is it possible to call "CallCopyRatioSegments" after adding CollectAllelicCounts, I thought 

    "CallCopyRatioSegments" works after coverage based call only i.e without adding allelic counts.

    Thank you.

     

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    Indira good question... the allelic counts go into ModelSegments, which uses the allelic counts to help with the segmentation. It outputs a copy ratio and an allele fraction per segment, and CallCopyRatioSegments uses only the copy ratio values to make its best guess at thresholds for amplifcations and deletions. 

    Basically the allelic counts are used for the segmentation but not the calls. 

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    I also might have missed your answer to this, but did you use a panel of normals for these data? 

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    No, I don't have panel of Normals. I used matched normal.

    And I looked into standarizedMAD.txt file, it is 0.3, does this look fine?

    Thanks.

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    It is recommended that you use both a panel of normals and a matched normal, but I'm not sure that would explain the anomalies. It could explain the negative copy ratios though: is your mean target coverage lower for the tumor than for the matched normal?

     

    Edit: that MAD value is on the higher end but in the range of values I've seen on clinical data, particularly for saliva samples.

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    Q20 is slightly higher in matched normal than tumor. This is not a clinical sample though.

    Thanks

     

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    I'm looking into whether that can shift the baseline. 

    Can I ask you if the mean of your denoised copy ratios TSV is negative? If so the issue is a step before ModelSegments; if not then ModelSegments itself is being funky.

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    Thank you for your reply. I just checked the denoised copy ratio mean and it is negative.

    Please suggest if any other corrections needed for model segment call.

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    Ok this makes more sense now. Depending on the magnitude of this difference: 

    * If it's small (mean log2 copy ratio > -0.1): the GATK caller (CallCopyRatioSegments) should adjust to center its calls around the sample mean, and call amplifications/deletions accordingly.

    * If it's large ( < 0.1): this either indicates that coverage is way lower for the second sample, or possibly some normalization issue involving the lack of a panel of normals. Without seeing the data firsthand I can't narrow it down further, but hopefully this gives you a starting point.

    If you'd like to provide plots of the data or ask anything else about your results, I'll continue to check this thread.

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    Thanks for your reply again. When you say the magnitude of the difference ,do you mean the difference between mean log2 copy ratio between sample and its normal.

    Is there any direct email you could share then I can share the denoise copy ratio file with you.

    Best.

     

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    So sorry I was unclear, I meant the mean of the denoised log2 copy ratios.

    Let's see how you can get your data files over to me.

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    Hi, would you mind providing me with the number for the mean denoised log2 copy ratio you metioned before? 

    Here's how to provide data files if you still feel you're getting unsatisfactory results: https://gatk.broadinstitute.org/hc/en-us/articles/360035889671-How-do-I-submit-a-detailed-bug-report

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    One more question, did you enable GC correction on your sample?

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    The mean log2 copy ratio is -3.080777967.  I didn't turn on any GC correction while running

    GATK, do you mean any other steps we need to do GC correction. 

    Thanks.

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    That's quite low. My guess is your sample has very low coverage. You can try looking at the sample and matched normal in IGV (https://igv.org/app/) to check this, looking at the relative coverage in target regions. 

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    Thanks. I will look into IGV installation then and reply back here.

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    I created a panel of Normals with couple of matched Normals from the cohort and ran one sample with this panel of Normals. The hyper segmentation didn't resolve with this panel of Normals either.

    Thanks

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    Ok good to know. 

    1. Did you check the sample's depth relative to the matched normal/other samples?

    2. Have you run any other samples through this pipeline, and are you getting the same results? I recommend you try running the matched normal and looking at the mean copy ratios as well as the segmentation plot coming from ModelSegments.

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

     

    1. There is not a lot difference in read coverage between sample and normal.

    2. I ran several other samples and they have similar hyper segmentation. This is how one of the normal plot looks like and mean copy ratio is zero.

    Thanks

    0
    Comment actions Permalink
  • Avatar
    Tamar Melman

    I see, thank you for sharing. I see two possible issues: 

    Sample QC is insufficient. If so:

    • Q20 might not be helpful here when evaluating read depth for short reads; I'd suggest looking at mean read depth in targeted regions. For targeted sequencing we would like it to be above 30, preferably above 60-90x. 
    • Additionally, check GC/AT dropout metrics for these samples.

    The way the tool is being deployed is insufficient. If you're getting such low copy ratio values and also oversegmentation, double check a few things:

    1. basics: the correct interval list for your targeted sequencing protocol, the correct reference, and the alignment looks good (recommend IGV for visualizing this).
    2. check your files from each step: look at the GATK best practices pipeline I linked to above (here) and double check all your parameters against what we use to run all the GATK processing steps (specifically PreprocessIntervals, CollectReadCounts, DenoiseReadCounts). Check the output files for each step: that preprocessing the intervals does what you expect, that you are getting reasonable values for the output of CollectReadCounts, and compare raw read counts to the denoised readcounts file to look for any changes.
    3. modifications to your process: run CollectAllelicCounts to detect het sites and look at minor allele fraction values (your normal should have MAF values at 0.5). I'd recommend using your allelic counts as inputs to ModelSegments. And
    4. run a different CNV calling tool: cnMOPS, Manta, etc. If one of those tools works for you, then that might be a better fit for your workflow. 

    Hopefully any of those steps will yield some clues. 

    0
    Comment actions Permalink
  • Avatar
    Indrani Datta

    I did ran GATK depthofcoverage on these sample/normal.

    sample_id,total,mean,granular_third_quartile,granular_median,granular_first_quartile,%_bases_above_15
    TUMOR,4090171663,13.95,9,1,1,21.0

    sample_id,total,mean,granular_third_quartile,granular_median,granular_first_quartile,%_bases_above_15
    NORMAL,5208816548,17.76,12,1,1,23.3

    Thanks for all suggestions and I will look into other cnv calling tools too.

     

     

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk