Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Inconsistency between intervals.vcf and Normalized Copy ratios txt file in GCNV workflow 4.2.6.1

Answered
0

12 comments

  • Official comment
    Avatar
    Andrey Smirnov

    Hi SkyWarrior,

    Computation of the copy ratios is done prior to running segmentation step in the post processing. 'Copy ratios' are a function of some posterior values and so it is normal to see signal in the 'copy ratios' be washed away once Viterbi is run.  

    It looks like the intervals weren't padded when running the pipeline. We usually set paddding=150 in the PreprocessIntervals to capture any reads that might lie outside of the capture baits. Note that we count how many read starts lie inside of the interval.

    I is definitely worth to run the PreprocessInterval steps in this case. It might not necessarily resolve the problem so playing with `cnv-coherence-length` like you have done along with `p-active`,  `interval-psi-scale`  and a few other hyperparameters would be the next step to try.

    One more thing to pay attention to is whether the interval under question was called as a common site or not. You can find that out by looking at the posterior values in the 'log_q_tau_tk.tsv'  file and comparing the two probabilities that are represented in log scale (first column is rare, second is common).

     

     

     
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Here is my Denoised copy ratio lines for the deleted locus

    chr19    45495095    45495139    1.7324975804922187
    chr19    45496792    45496825    0.1841071070156478
    chr19    45498473    45499411    1.5284935902534942

    Here is the output within intervals.vcf

    chr19    45495095    CNV_chr19_45495095_45495139    N    <DEL>,<DUP>    .    .    END=45495139    GT:CN:CNLP:CNQ    0:2:100,67,0,92,99,100:67
    chr19    45496792    CNV_chr19_45496792_45496825    N    <DEL>,<DUP>    .    .    END=45496825    GT:CN:CNLP:CNQ    0:2:41,56,0,92,99,100:41
    chr19    45498473    CNV_chr19_45498473_45499411    N    <DEL>,<DUP>    .    .    END=45499411    GT:CN:CNLP:CNQ    0:2:100,84,0,100,100,100:84

    Here is the readcounts of the same locus

    chr19    45495095    45495139    17
    chr19    45496792    45496825    1
    chr19    45498473    45499411    577

    Here is the IGV image

     

    This sample was run twice once using a known normal set and once within the same cohort of samples and both show the same problem. 

    This is an obvious show stopper for me. I will try the same analysis using an older version of gcnvkernel in 4.1.7.0 and see if the problem occurs. 

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    XHMM with a little more optimized parameters (1e-4 CNV distribution, 0.2 normalization value) captured this homozygous deletion quite nicely. I would expect simple PCA to be less precise it isn't. Default XHMM parameters could not capture this. 

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Btw 

    A few clarifications

    1- Genome : hg38 from GIAB

    2- İntervals are exons from the capture kit's bed file. 

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Same problem persists with gatk 4.1.7.0. I will try one of the earliest versions of the tool to see if it works.

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi Andrey Smirnov

    Padding idea is interesting. Based on what you say it would not cause nearby intervals to be merged since padding is only applied to find the reads. Although I don't think this may be the direct solution to my current problem it may help me solve the issue of samples with extreme AT GC dropout rates which causes captured fragments to be skewed away from the center of the exon. 

    I will start with padding option first and will try to modify cnv_coherence_length and other parameters. I will let you know. 

     

    EDIT: CollectReadCounts tool does not let me add padding to intervals. Manual padding is scary since it may cause closer intervals to be fused as a single interval which I don't want to. 

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Looks like that interval was considered a common site

    chr19    45495095    45495139    0.511111    -2.32E-05    -10.67037244
    chr19    45496792    45496825    0.676471    -1.64E-06    -13.31876888
    chr19    45498473    45499411    0.7082    -2.61E-07    -15.15960816
    0
    Comment actions Permalink
  • Avatar
    Andrey Smirnov

    SkyWarrior You can use PreprocessIntervals GATK tool perform interval padding for you.

    Also see the full recommended gCNV workflow here: https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_wdl/germline/cnv_germline_cohort_workflow.wdl

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    I just realized after my trial that would be the way to do it. However it will certainly merge many of the short and closely packed exons into a larger interval and will reduce sensitivity when only a single exon is deleted from within so that's why I prefer not to add padding. 

    By the way counting the read starts is probably not a very good idea for many as read depth is still there even when start positions are skewed due to GC or AT dropout and there we have our many false positive and false negatives occur. Guess I have to try a different solution soon to make sure that binned read depth is included instead of collected read starts. I may use samtools or others to collect then some bash wizardry to mimic tsv format of CollectReadCounts. 

    Coming back to my question again. postprocessgermlinecnvs produce an interval.vcf first then runs viterbi on the posteriors found in the CNVcall step. Is interval.vcf produced from an viterbi function or is it generated purely from the posteriors. As the denoised copy ratios file shows correct estimation of the copy number I was wondering why there are 2 seperate results coming from the same posterior results which were before the segmentation viterbi function?

     

    0
    Comment actions Permalink
  • Avatar
    Mark Walker

    Hello SkyWarrior, the PreprocessIntervals tool won't merge intervals if they overlap after padding - the padding is only extended until they are adjacent.

    Can you elaborate a little more on your concerns about using read starts for counting? I would expect GC bias to be an issue regardless of the particular strategy used.

    Good question about the discrepancy between denoised copy ratios and posteriors. First, the intervals VCF is produced from the "caller" model posteriors (pre-Viterbi). The copy ratios are the inferred mean of the counts distribution, which does not always reflect the most likely copy state (i.e. due to higher moments of the posterior).

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Actually my concern was with those samples that have skewed read starts located with respect to the exon regions however with what you and Andrew said now it makes more sense since intervals are not merged but extended only until they touch each other then counting read starts seems more legit since it will cover a more extended region and it will be faster. 

    Some of my exomes are problematic in nature (whether it is DNA quality or something else I noticed that tip of the coverage curves don't align with exon centers.) In those samples since read starts are missed some of the GC poor chromosomes end up with less reads therefore both contig ploidy and exonic CNV calls are affected. 

    I will try with this padding approach once again and see how it works for my skewed samples and this particular sample. 

    I will let you know. 

    About the denoised.txt and intervals.vcf I understand your point since they are not %100 correlated outputs. As Andrew indicated that region is somehow marked as common but not rare so I will try to find the reason first. 

    Thanks for the help. 

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi

    Looks like padding has provided some level of consistency for the same calls from the related members. RTN2 deletion was also called I am digging more into it. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk