Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

How to filter the results of svgenotype



  • 0
    Comment actions Permalink
  • Avatar
    Bob Handsaker

    Alternatively, you could look at the CNL output to see the likelihood of the entire range of candidate copy numbers and choose the even integer with the highest likelihood.  In this case, you could also re-estimate CNQ by re-normalizing CNL to even numbers only and estimate CNQ as the difference between the most-likely and second-most-likely CNL value restricted to the even copy numbers.

    Comment actions Permalink
  • Avatar




    ###The result of re-estimating CN is as follows,is wright.

    first: The above is my calculation process, I find it very strange, may I ask where I am wrong?can you show the detail for me ? thank ,for SNP,L is the value of Phred-scaled,P is the Probability of Supporting Genotype ,P=10^(-L/10),GQ:-10*log(1-P),for CN: that's means to calculate twice formula(the most-likely and second-most-likely CNL value : The calculation process is the same as GQ) ,and gets two value and subtract two value,The difference between PL and CNQ is that ,one is positive and the other is a negative value
    for even numbers of CN,need to re-estimate the CNQ?

    second:after chooseing the even integer and re-estimate CNQ, the filtering standard :CNQ >20 ,What should I do to get accurate genotyping results for GWAS?


    Comment actions Permalink
  • Avatar
    Bob Handsaker

    There are probably multiple ways to do this, but below is my suggestion.

    One thing that may be confusing you is that CNQ is calculated based on CNP (not CNL), so what I wrote above about recalculating CNQ from CNL is a little misleading. The difference is that CNP uses the overall likelihood of each cluster as a prior (which assumes you are drawing from a population in HWE). You could do the following calculation based either on CNL or on CNP. I will use CNP for consistency with the default model assumptions. You could use the calculation below to re-estimate CNL as well, and then either use CNQ based either on CNL or on CNP as is done by default.

    The other assumption I am making here is that it is reasonable to apportion the likelihood of the odd copy numbers to the next highest/lowest even copy number by splitting the likelihood evenly. This is perhaps not exactly right, but probably close enough.

    Here's an R function:

    adjustForEvenCN <- function(cnps) {
    p = 10^cnps
    p = p/sum(p)
    evenInds = seq(1,length(p),2)
    oddInds = seq(2,length(p),2)
    # Apportion the odd likelihoods to the even CNs on both sides
    pnew = c(p, rep(0,length(oddInds)+1-length(evenInds)))
    pnew[oddInds] = 0
    pnew[oddInds-1] = pnew[oddInds-1] + p[oddInds]/2
    pnew[oddInds+1] = pnew[oddInds+1] + p[oddInds]/2
    best = which.max(pnew)
    p2 = pnew
    p2[best] = 0
    best2 = which.max(p2)
    bestcn = best-1
    cnp = pmax(log10(pnew), rep(-1000,length(pnew)))
    cnq = 10*(cnp[best]-cnp[best2])
    cat(sprintf("Input CNP = %s\n", paste(cnps, collapse=",")))
    cat(sprintf("pnew = %s\n", paste(pnew, collapse=",")))
    cat(sprintf("CN = %d\n", bestcn))
    cat(sprintf("CNQ = %1.1f\n", cnq))
    cat(sprintf("CNP = %s\n", paste(sprintf("%1.2f",cnp), collapse=",")))

    And these are three of your examples (using CNP), the first, second and last (with CN 10):

    cnp1 = c(-1000.00,-43.56,-2.98,-0.00,-3.22,-9.54,-17.68,-26.67,-36.58,-46.43,-56.31)
    cnp2 = c(-1000.00,-26.26,-0.00,-3.01,-10.13,-19.53,-30.33,-41.74,-53.91,-65.93,-77.92)
    cnp3 = c(-1000.00,-1000.00,-1000.00,-188.31,-105.91,-60.40,-33.60,-17.30,-7.94,-2.54,-0.00)


    Input CNP = -1000,-43.56,-2.98,0,-3.22,-9.54,-17.68,-26.67,-36.58,-46.43,-56.31
    pnew = 1.37484628367215e-44,0,0.500221918240834,0,0.499778081615202,0,1.43964081367335e-10,0,1.06722046377077e-27,0,1.85461660643178e-47
    CN = 2
    CNQ = 0.0
    CNP = -43.86,-1000.00,-0.30,-1000.00,-0.30,-1000.00,-9.84,-1000.00,-26.97,-1000.00,-46.73
    Input CNP = -1000,-26.26,0,-3.01,-10.13,-19.53,-30.33,-41.74,-53.91,-65.93,-77.92
    pnew = 2.74502183157847e-27,0,0.999511858345621,0,0.000488141654378985,0,1.47416400533487e-20,0,9.08962157586002e-43,0,5.86875261078268e-67
    CN = 2
    CNQ = 33.1
    CNP = -26.56,-1000.00,-0.00,-1000.00,-3.31,-1000.00,-19.83,-1000.00,-42.04,-1000.00,-66.23
    Input CNP = -1000,-1000,-1000,-188.31,-105.91,-60.4,-33.6,-17.3,-7.94,-2.54,0
    pnew = 0,0,2.44185169160147e-189,0,1.98481157088062e-61,0,2.4987297242047e-18,0,0.00143788032443704,0,0.998562119675563
    CN = 10
    CNQ = 28.4
    CNP = -1000.00,-1000.00,-188.61,-1000.00,-60.70,-1000.00,-17.60,-1000.00,-2.84,-1000.00,-0.00

    Comment actions Permalink
  • Avatar
    Bob Handsaker

    This may be beyond what can be easily answered in this forum. Am happy to have a longer discussion if you want to contact me directly. Here are some thoughts:

    1. To evaluate whether some of the less-frequent high-low copy number estimates are errors or not, you can use tools like PlotGenotypingResults and/or look at the confidence scores / copy number likelihoods. You can also evaluate whether you see excess calls in particular samples (which can be an indication of excess variability in the read depth signal) and might suggest removing that sample from your association tests.

    2. If you want to filter common vs. rare variants, one approach we have used in the past is to not test for minor allele frequency < 0.01, but rather major allele frequency < 0.99. In the multi-allelic case, this is the same as lumping all minor alleles together for computing the frequency threshold.

    3. One common way to do association tests is to test the association of copy number (as a continuous quantity) to phenotype, for example using a linear model. For self-pollinating plants, I think this should be very similar to testing for a dosage effect on phenotype.


    Comment actions Permalink
  • Avatar

          for [One common way to do association tests is to test the association of copy number (as a continuous quantity) to phenotype],i don't understand ,
          Which software can test the correlation between copy number (as a continuous number) and phenotype? Most of the software is to genotype the copy number like SNP(two alleles), and then use the general linear model GLM, such as Tassel, EMMAX.
          so iI don't understand that association of copy number (as a continuous quantity) to phenotype.

    Comment actions Permalink
  • Avatar
    Bob Handsaker

    I usually do this kind of analysis myself using R (no CNV-specific package, just custom R code using glm()).

    I asked a couple of colleagues, who said the same. We did a quick google search and looked at some recent publications, but from a quick reading, none of them appear to implement this kind of allelic series / dosage test. Perhaps you might find such a package with a deeper search, however. If you do, feel free to post about it!



    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk