Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Somatic CNV hypersegmentation introduced by PoN



  • Official comment
    Samuel Lee

    Hi Romanos,

    Can you clear up how the before and after plots were produced?  Was the "before" plot was created using older versions of this pipeline (i.e., ReCapSeg/AllelicCapSeg), and the "after" plot created using more a recent pipeline that incorporates GATK4 ModelSegments?  Or were both plots created using the ModelSegments pipeline, with "before" and "after" showing the result of denoising with/without a PoN?

    Since you are showing the old AllelicCapSeg-style plots, I'm guessing the situation might be the former.  Could you instead show the plots generated by PlotModeledSegments, which show the copy-ratio and allele-fraction data points along with the segments?  I'm guessing the hypersegmentation in chr8 might arise from oversegmentation in the allele-fraction data---perhaps this sample is unusually noisy.  If so, adjusting the appropriate segmentation parameters could address this.  However, it's impossible to say without seeing the data points (which is why we prefer the new method of plotting).

    Comment actions Permalink
  • Hi Sam, 

    "Before" and "After were not appropriate titles for the plots above. "Before" was produced with the old workflow used by the CGA WES pipeline and a 1000G PoN, while "After" was produced with GATK4 tools and a custom PoN. 

    Here are the PlotModeledSegment plots from this same sample, run on the GATK4 pipeline with either a TCGA PoN of 86 samples (Plot #1) or a custom PoN OF ~700 samples (Plot #2). Both PoNs were produced with Agilent intervals, bin =0, default padding, no blacklisted intervals, default interval merging behavior, GC correction, and default number of components. You can see here that the allelic copy number profile is cleaner with the custom PoN, but certain chromosomes are hypersegmented, compared to the TCGA_86 run. The plot is not shown here, because of size restrictions, but just like with the custom PoN, a TCGA_395 PoN gave cleaner profile & hypersegmentation (for example, in chr8).


    Plot #1: TCGA_86

    Plot #2: Custom_700

    Comment actions Permalink
  • Adding more results here for troubleshooting: 

    I used the Custom 700 PoN with the GATK3 workflow and no hypersegmentation was observed (Plot #3, produced by Absolute off of AllelicCapseg output).

    Comment actions Permalink
  • Avatar
    Samuel Lee

    Zooming way in on both Plot #1 and #2, you can see some suspicious hom sites (with alternate allele fraction = 0 or 1) in chr8.  My guess is that something unusual is happening in the SNP genotyping step in your normal (perhaps due to low coverage or some other data-quality issue with that sample) that is causing homs in the normal to be mistaken for hets.  Note that these same sites are used in the tumor, which may ultimately be leading to the results you see.  If this is the case, perhaps you can try changing the genotyping parameters to be a little more strict.  But it's hard to say exactly what is going on without looking at the results in more detail.

    In general, visually examining the model fit to the data can help you decide whether you should be more stringent with the data (e.g., by filtering) or more flexible with the segmentation/modeling (e.g., by more aggressively smoothing).  You should also take steps to perform QC on your samples to make sure the incoming data is homogeneous and that the same set of parameters will yield comparable results across the entire cohort.

    Finally, if you want to frequently compare AllelicCapSeg and ModelSegments results in the future, you might consider putting together a script that allows the output of AllelicCapSeg to be consumed by PlotModelSegments.  If you study the formats of the files generated/expected by each tool, I suspect that only a relatively simple script would be required to convert between them (although you might have to dummy up some quantities that don't exactly translate---e.g., perhaps just substitute the same AllelicCapSeg estimate for each of the posterior quantiles reported in the ModelSegments seg file).  This would make it much easier to see how the data collection and modeling differ between the tools when you get discrepancies like this.

    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk