Inconsistency between intervals.vcf and Normalized Copy ratios txt file in GCNV workflow 4.2.6.1
AnsweredHi
I observed an inconsistency between these 2 files generated for 2 samples in 2 different runs of GCNV workflow using GATK 4.2.6.1 docker image. I am not sure if my problem is similar to this one but certainly it feels so.
For one of the samples we had a MLPA confirmed heterozygous deletion yet normalized copy ratios txt file shows 1 copy for those intervals but intervals.vcf shows homref therefore CNV call is missed in the segments. This sample was run against a verified normal set in case mode.
A second sample worked in a cohort setting and has a hom deletion with obvious breakpoints in the bam file. copy ratio txt file shows 0 copy for the deleted exon and readcount file shows only 1 read at that location. However intervals vcf shows homref again yet another missed CNV event in the segments file.
First case was rerun in a seperate cohort run and both txt file and intervals vcf showed 1 copy therefore segments vcf showed the proper CNV event called.
I am trying the second case this time using a seperate normal model to see if copy ratio is detected properly this time but I am concerned that something is not right with intervals.vcf creation and inconsistently calls known deletions as homref.
I will post more once I had the data from the second case retrial.
Can you help with this?
Thanks.
-
-
Official comment
Hi SkyWarrior,
Computation of the copy ratios is done prior to running segmentation step in the post processing. 'Copy ratios' are a function of some posterior values and so it is normal to see signal in the 'copy ratios' be washed away once Viterbi is run.
It looks like the intervals weren't padded when running the pipeline. We usually set paddding=150 in the PreprocessIntervals to capture any reads that might lie outside of the capture baits. Note that we count how many read starts lie inside of the interval.
I is definitely worth to run the PreprocessInterval steps in this case. It might not necessarily resolve the problem so playing with `cnv-coherence-length` like you have done along with `p-active`, `interval-psi-scale` and a few other hyperparameters would be the next step to try.
One more thing to pay attention to is whether the interval under question was called as a common site or not. You can find that out by looking at the posterior values in the 'log_q_tau_tk.tsv' file and comparing the two probabilities that are represented in log scale (first column is rare, second is common).
Comment actions -
Here is my Denoised copy ratio lines for the deleted locus
chr19 45495095 45495139 1.7324975804922187
chr19 45496792 45496825 0.1841071070156478
chr19 45498473 45499411 1.5284935902534942Here is the output within intervals.vcf
chr19 45495095 CNV_chr19_45495095_45495139 N <DEL>,<DUP> . . END=45495139 GT:CN:CNLP:CNQ 0:2:100,67,0,92,99,100:67
chr19 45496792 CNV_chr19_45496792_45496825 N <DEL>,<DUP> . . END=45496825 GT:CN:CNLP:CNQ 0:2:41,56,0,92,99,100:41
chr19 45498473 CNV_chr19_45498473_45499411 N <DEL>,<DUP> . . END=45499411 GT:CN:CNLP:CNQ 0:2:100,84,0,100,100,100:84Here is the readcounts of the same locus
chr19 45495095 45495139 17
chr19 45496792 45496825 1
chr19 45498473 45499411 577Here is the IGV image
This sample was run twice once using a known normal set and once within the same cohort of samples and both show the same problem.
This is an obvious show stopper for me. I will try the same analysis using an older version of gcnvkernel in 4.1.7.0 and see if the problem occurs.
-
XHMM with a little more optimized parameters (1e-4 CNV distribution, 0.2 normalization value) captured this homozygous deletion quite nicely. I would expect simple PCA to be less precise it isn't. Default XHMM parameters could not capture this.
-
Btw
A few clarifications
1- Genome : hg38 from GIAB
2- İntervals are exons from the capture kit's bed file.
-
Same problem persists with gatk 4.1.7.0. I will try one of the earliest versions of the tool to see if it works.
-
Padding idea is interesting. Based on what you say it would not cause nearby intervals to be merged since padding is only applied to find the reads. Although I don't think this may be the direct solution to my current problem it may help me solve the issue of samples with extreme AT GC dropout rates which causes captured fragments to be skewed away from the center of the exon.
I will start with padding option first and will try to modify cnv_coherence_length and other parameters. I will let you know.
EDIT: CollectReadCounts tool does not let me add padding to intervals. Manual padding is scary since it may cause closer intervals to be fused as a single interval which I don't want to.
-
Looks like that interval was considered a common site
chr19 45495095 45495139 0.511111 -2.32E-05 -10.67037244
chr19 45496792 45496825 0.676471 -1.64E-06 -13.31876888
chr19 45498473 45499411 0.7082 -2.61E-07 -15.15960816 -
SkyWarrior You can use PreprocessIntervals GATK tool perform interval padding for you.
Also see the full recommended gCNV workflow here: https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_wdl/germline/cnv_germline_cohort_workflow.wdl
-
I just realized after my trial that would be the way to do it. However it will certainly merge many of the short and closely packed exons into a larger interval and will reduce sensitivity when only a single exon is deleted from within so that's why I prefer not to add padding.
By the way counting the read starts is probably not a very good idea for many as read depth is still there even when start positions are skewed due to GC or AT dropout and there we have our many false positive and false negatives occur. Guess I have to try a different solution soon to make sure that binned read depth is included instead of collected read starts. I may use samtools or others to collect then some bash wizardry to mimic tsv format of CollectReadCounts.
Coming back to my question again. postprocessgermlinecnvs produce an interval.vcf first then runs viterbi on the posteriors found in the CNVcall step. Is interval.vcf produced from an viterbi function or is it generated purely from the posteriors. As the denoised copy ratios file shows correct estimation of the copy number I was wondering why there are 2 seperate results coming from the same posterior results which were before the segmentation viterbi function?
-
Hello SkyWarrior, the PreprocessIntervals tool won't merge intervals if they overlap after padding - the padding is only extended until they are adjacent.
Can you elaborate a little more on your concerns about using read starts for counting? I would expect GC bias to be an issue regardless of the particular strategy used.
Good question about the discrepancy between denoised copy ratios and posteriors. First, the intervals VCF is produced from the "caller" model posteriors (pre-Viterbi). The copy ratios are the inferred mean of the counts distribution, which does not always reflect the most likely copy state (i.e. due to higher moments of the posterior).
-
Actually my concern was with those samples that have skewed read starts located with respect to the exon regions however with what you and Andrew said now it makes more sense since intervals are not merged but extended only until they touch each other then counting read starts seems more legit since it will cover a more extended region and it will be faster.
Some of my exomes are problematic in nature (whether it is DNA quality or something else I noticed that tip of the coverage curves don't align with exon centers.) In those samples since read starts are missed some of the GC poor chromosomes end up with less reads therefore both contig ploidy and exonic CNV calls are affected.
I will try with this padding approach once again and see how it works for my skewed samples and this particular sample.
I will let you know.
About the denoised.txt and intervals.vcf I understand your point since they are not %100 correlated outputs. As Andrew indicated that region is somehow marked as common but not rare so I will try to find the reason first.
Thanks for the help.
-
Hi
Looks like padding has provided some level of consistency for the same calls from the related members. RTN2 deletion was also called I am digging more into it.
Please sign in to leave a comment.
12 comments