Somatic CNV hypersegmentation introduced by PoN
Hello, I have used this workflow on FireCloud (github.com/gatk-workflows/gatk4-somatic-cnvs/cnv_somatic_pair_workflow & us.gcr.io/broad-gatk/gatk:4.1.0.0) on Tumor-Normal exomes with a custom PoN made of ~800 matched normals.
I noticed that while the PoN successfully cleaned a lot of samples, it appears to have introduced hypersegmentation in certain instances (see before & after CNV profiles attached).
Before:
After:
Any ideas why & how to fix it?
Thanks!
-
Official comment
Hi Romanos,
Can you clear up how the before and after plots were produced? Was the "before" plot was created using older versions of this pipeline (i.e., ReCapSeg/AllelicCapSeg), and the "after" plot created using more a recent pipeline that incorporates GATK4 ModelSegments? Or were both plots created using the ModelSegments pipeline, with "before" and "after" showing the result of denoising with/without a PoN?
Since you are showing the old AllelicCapSeg-style plots, I'm guessing the situation might be the former. Could you instead show the plots generated by PlotModeledSegments, which show the copy-ratio and allele-fraction data points along with the segments? I'm guessing the hypersegmentation in chr8 might arise from oversegmentation in the allele-fraction data---perhaps this sample is unusually noisy. If so, adjusting the appropriate segmentation parameters could address this. However, it's impossible to say without seeing the data points (which is why we prefer the new method of plotting).
Comment actions -
Hi Sam,
"Before" and "After were not appropriate titles for the plots above. "Before" was produced with the old workflow used by the CGA WES pipeline and a 1000G PoN, while "After" was produced with GATK4 tools and a custom PoN.
Here are the PlotModeledSegment plots from this same sample, run on the GATK4 pipeline with either a TCGA PoN of 86 samples (Plot #1) or a custom PoN OF ~700 samples (Plot #2). Both PoNs were produced with Agilent intervals, bin =0, default padding, no blacklisted intervals, default interval merging behavior, GC correction, and default number of components. You can see here that the allelic copy number profile is cleaner with the custom PoN, but certain chromosomes are hypersegmented, compared to the TCGA_86 run. The plot is not shown here, because of size restrictions, but just like with the custom PoN, a TCGA_395 PoN gave cleaner profile & hypersegmentation (for example, in chr8).
Thanks!
Plot #1: TCGA_86
Plot #2: Custom_700
-
Adding more results here for troubleshooting:
I used the Custom 700 PoN with the GATK3 workflow and no hypersegmentation was observed (Plot #3, produced by Absolute off of AllelicCapseg output).
-
Zooming way in on both Plot #1 and #2, you can see some suspicious hom sites (with alternate allele fraction = 0 or 1) in chr8. My guess is that something unusual is happening in the SNP genotyping step in your normal (perhaps due to low coverage or some other data-quality issue with that sample) that is causing homs in the normal to be mistaken for hets. Note that these same sites are used in the tumor, which may ultimately be leading to the results you see. If this is the case, perhaps you can try changing the genotyping parameters to be a little more strict. But it's hard to say exactly what is going on without looking at the results in more detail.
In general, visually examining the model fit to the data can help you decide whether you should be more stringent with the data (e.g., by filtering) or more flexible with the segmentation/modeling (e.g., by more aggressively smoothing). You should also take steps to perform QC on your samples to make sure the incoming data is homogeneous and that the same set of parameters will yield comparable results across the entire cohort.
Finally, if you want to frequently compare AllelicCapSeg and ModelSegments results in the future, you might consider putting together a script that allows the output of AllelicCapSeg to be consumed by PlotModelSegments. If you study the formats of the files generated/expected by each tool, I suspect that only a relatively simple script would be required to convert between them (although you might have to dummy up some quantities that don't exactly translate---e.g., perhaps just substitute the same AllelicCapSeg estimate for each of the posterior quantiles reported in the ModelSegments seg file). This would make it much easier to see how the data collection and modeling differ between the tools when you get discrepancies like this.
Please sign in to leave a comment.
4 comments