How to filter the results of svgenotype
dear
For self-pollinated crops, the CN should be at intervals of 2, but there are some sites where the CN is 3, 5, 11, how should I deal with these data? Next, I want to make some associations of traits. How should I define missing, normal, and duplicate. treated all copy number genotypes ≤ 1 as losses, all copy number genotypes >3 as gains, and all the remaining genotypes as 3 unchanged. is right ?
##resullts
16 14464812 14482312 G <CNV> . . . CN 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
16 20369304 20418235 A <CNV> . . . CN 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
16 28017345 28021135 A <CNV> . . . CN 7 2 2 5 7 2 4 2 7 2 2 2 2 2 5 5 5 5 2 5 8 2
16 32957908 32960257 T <CNV> . . . CN 2 2 2 0 2 2 2 1 2 2 4 2 2 6 4 2 2 4 2 2 2 2
16 36992836 36997173 A <CNV> . . . CN 7 2 2 6 6 2 5 2 6 2 2 4 2 2 5 5 5 6 4 4 7 3
Best wishes
-
Alternatively, you could look at the CNL output to see the likelihood of the entire range of candidate copy numbers and choose the even integer with the highest likelihood. In this case, you could also re-estimate CNQ by re-normalizing CNL to even numbers only and estimate CNQ as the difference between the most-likely and second-most-likely CNL value restricted to the even copy numbers.
-
Hello
###example
20 1009500 CNV_20_1009500_1022750 T <CNV> . PASS END=1022750;GCFRACTION=0.33;GCLENGTH=13251;GLALTFREQ=NA;GLALTSUM=0.000;GLHETSUM=0.000;GLINBREEDINGCOEFF=NA;GLREFFREQ=NA;GLREFSUM=0.000;GSCALLRATE=0.968;GSCLUSTERSEP=6.44;GSCLUSTERSEPWEIGHTEDMEAN=5.50;GSCLUSTERSEPWEIGHTEDMEDIAN=5.56;GSCNALLELES=5;GSCNCATEGORY=DUP;GSCNDIST=0,0,706,7,4,4,2,1,0,0,1;GSCNMAX=10;GSCNMIN=2;GSCNQUAL=658.7038;GSDUPLICATEOVERLAP=NA;GSDUPLICATES=NA;GSDUPLICATESCORE=NA;GSELENGTH=13251;GSEXPMEAN=104.7447;GSGMMWEIGHTS=0.0001,0.0001,0.9476,0.0136,0.0115,0.0115,0.0068,0.0046,0.0014,0.0012,0.0016;GSM1=0.8850;GSM2=0.2618,1.3092;GSNNONREF=19;GSNONVARSCORE=NA;GSNVARIANT=19;SVTYPE=CNV
GT:CN:CNF:CNL:CNP:CNQ:FT:GQ:GSPC
.:3:2.9431:-1000.00,-41.56,-4.82,-0.00,-3.14,-9.47,-17.38,-26.21,-35.59,-45.36,-55.39:-1000.00,-43.56,-2.98,-0.00,-3.22,-9.54,-17.68,-26.67,-36.58,-46.43,-56.31:29.8:PASS:.:0
.:2:2.3450:-1000.00,-22.44,-0.03,-1.19,-8.24,-17.64,-28.21,-39.45,-51.11,-63.05,-75.18:-1000.00,-26.26,-0.00,-3.01,-10.13,-19.53,-30.33,-41.74,-53.91,-65.93,-77.92:30.1:PASS:.:0
.:5:4.7757:-1000.00,-180.50,-48.58,-13.11,-1.74,-0.01,-3.09,-8.93,-16.48,-25.18,-34.68:-1000.00,-182.44,-46.67,-13.04,-1.74,-0.01,-3.32,-9.32,-17.39,-26.17,-35.53:17.3:PASS:.:0
.:4:3.5883:-1000.00,-75.20,-13.73,-0.83,-0.07,-4.15,-10.66,-18.55,-27.31,-36.64,-46.37:-1000.00,-77.15,-11.83,-0.77,-0.08,-4.16,-10.90,-18.96,-28.23,-37.64,-47.23:6.9:LQ:.:0
.:8:8.1374:-1000.00,-1000.00,-203.92,-95.20,-46.28,-21.27,-8.22,-2.01,-0.06,-0.95,-3.84:-1000.00,-1000.00,-201.09,-94.21,-45.37,-20.36,-7.54,-1.49,-0.06,-1.04,-3.78:9.8:LQ:.:0
.:7:7.3261:-1000.00,-1000.00,-168.19,-73.87,-32.67,-12.72,-3.38,-0.11,-0.64,-3.69,-8.51:-1000.00,-1000.00,-165.80,-73.33,-32.20,-12.25,-3.14,-0.04,-1.08,-4.21,-8.89:10.4:LQ:.:0
.:3:3.1653:-1000.00,-40.14,-5.70,-0.01,-1.50,-5.85,-11.64,-18.24,-25.35,-32.80,-40.49:-1000.00,-42.15,-3.86,-0.01,-1.57,-5.92,-11.93,-18.71,-26.33,-33.87,-41.41:15.6:PASS:.:0
.:2:2.1481:-1000.00,-21.02,-0.00,-3.83,-13.86,-26.37,-40.11,-54.56,-69.45,-84.63,-100.02:-1000.00,-24.87,-0.00,-5.67,-15.78,-28.28,-42.25,-56.87,-72.28,-87.55,-102.79:56.7:PASS:.:0
.:2:2.1389:-263.12,-14.69,-0.00,-2.83,-10.02,-18.97,-28.78,-39.08,-49.70,-60.52,-71.49:-266.97,-18.54,-0.00,-4.67,-11.94,-20.88,-30.92,-41.40,-52.53,-63.44,-74.26:46.7:PASS:.:0
.:2:2.0001:-246.12,-12.18,-0.00,-4.20,-12.48,-22.39,-33.12,-44.31,-55.78,-67.45,-79.26:-249.97,-16.03,-0.00,-6.04,-14.40,-24.31,-35.26,-46.62,-58.61,-70.37,-82.03:60.4:PASS:.:0
.:10:10.5832:-1000.00,-1000.00,-1000.00,-189.23,-106.76,-61.26,-34.22,-17.75,-7.88,-2.40,-0.00:-1000.00,-1000.00,-1000.00,-188.31,-105.91,-60.40,-33.60,-17.30,-7.94,-2.54,-0.00:25.4:PASS:.:0###The result of re-estimating CN is as follows,is wright.
.:4
.:2
.:4
.:4
.:8
.:8
.:4
.:2
.:2
.:2
.:10first: The above is my calculation process, I find it very strange, may I ask where I am wrong?can you show the detail for me ? thank ,for SNP,L is the value of Phred-scaled,P is the Probability of Supporting Genotype ,P=10^(-L/10),GQ:-10*log(1-P),for CN: that's means to calculate twice formula(the most-likely and second-most-likely CNL value : The calculation process is the same as GQ) ,and gets two value and subtract two value,The difference between PL and CNQ is that ,one is positive and the other is a negative value
for even numbers of CN,need to re-estimate the CNQ?second:after chooseing the even integer and re-estimate CNQ, the filtering standard :CNQ >20 ,What should I do to get accurate genotyping results for GWAS?
-
There are probably multiple ways to do this, but below is my suggestion.
One thing that may be confusing you is that CNQ is calculated based on CNP (not CNL), so what I wrote above about recalculating CNQ from CNL is a little misleading. The difference is that CNP uses the overall likelihood of each cluster as a prior (which assumes you are drawing from a population in HWE). You could do the following calculation based either on CNL or on CNP. I will use CNP for consistency with the default model assumptions. You could use the calculation below to re-estimate CNL as well, and then either use CNQ based either on CNL or on CNP as is done by default.
The other assumption I am making here is that it is reasonable to apportion the likelihood of the odd copy numbers to the next highest/lowest even copy number by splitting the likelihood evenly. This is perhaps not exactly right, but probably close enough.
Here's an R function:
adjustForEvenCN <- function(cnps) {
p = 10^cnps
p = p/sum(p)
evenInds = seq(1,length(p),2)
oddInds = seq(2,length(p),2)
# Apportion the odd likelihoods to the even CNs on both sides
pnew = c(p, rep(0,length(oddInds)+1-length(evenInds)))
pnew[oddInds] = 0
pnew[oddInds-1] = pnew[oddInds-1] + p[oddInds]/2
pnew[oddInds+1] = pnew[oddInds+1] + p[oddInds]/2
best = which.max(pnew)
p2 = pnew
p2[best] = 0
best2 = which.max(p2)
bestcn = best-1
cnp = pmax(log10(pnew), rep(-1000,length(pnew)))
cnq = 10*(cnp[best]-cnp[best2])
cat(sprintf("Input CNP = %s\n", paste(cnps, collapse=",")))
cat(sprintf("pnew = %s\n", paste(pnew, collapse=",")))
cat(sprintf("CN = %d\n", bestcn))
cat(sprintf("CNQ = %1.1f\n", cnq))
cat(sprintf("CNP = %s\n", paste(sprintf("%1.2f",cnp), collapse=",")))
}And these are three of your examples (using CNP), the first, second and last (with CN 10):
cnp1 = c(-1000.00,-43.56,-2.98,-0.00,-3.22,-9.54,-17.68,-26.67,-36.58,-46.43,-56.31)
cnp2 = c(-1000.00,-26.26,-0.00,-3.01,-10.13,-19.53,-30.33,-41.74,-53.91,-65.93,-77.92)
cnp3 = c(-1000.00,-1000.00,-1000.00,-188.31,-105.91,-60.40,-33.60,-17.30,-7.94,-2.54,-0.00)
adjustForEvenCN(cnp1)
adjustForEvenCN(cnp2)
adjustForEvenCN(cnp3)Output:
Input CNP = -1000,-43.56,-2.98,0,-3.22,-9.54,-17.68,-26.67,-36.58,-46.43,-56.31
pnew = 1.37484628367215e-44,0,0.500221918240834,0,0.499778081615202,0,1.43964081367335e-10,0,1.06722046377077e-27,0,1.85461660643178e-47
CN = 2
CNQ = 0.0
CNP = -43.86,-1000.00,-0.30,-1000.00,-0.30,-1000.00,-9.84,-1000.00,-26.97,-1000.00,-46.73
Input CNP = -1000,-26.26,0,-3.01,-10.13,-19.53,-30.33,-41.74,-53.91,-65.93,-77.92
pnew = 2.74502183157847e-27,0,0.999511858345621,0,0.000488141654378985,0,1.47416400533487e-20,0,9.08962157586002e-43,0,5.86875261078268e-67
CN = 2
CNQ = 33.1
CNP = -26.56,-1000.00,-0.00,-1000.00,-3.31,-1000.00,-19.83,-1000.00,-42.04,-1000.00,-66.23
Input CNP = -1000,-1000,-1000,-188.31,-105.91,-60.4,-33.6,-17.3,-7.94,-2.54,0
pnew = 0,0,2.44185169160147e-189,0,1.98481157088062e-61,0,2.4987297242047e-18,0,0.00143788032443704,0,0.998562119675563
CN = 10
CNQ = 28.4
CNP = -1000.00,-1000.00,-188.61,-1000.00,-60.70,-1000.00,-17.60,-1000.00,-2.84,-1000.00,-0.00 -
This may be beyond what can be easily answered in this forum. Am happy to have a longer discussion if you want to contact me directly. Here are some thoughts:
1. To evaluate whether some of the less-frequent high-low copy number estimates are errors or not, you can use tools like PlotGenotypingResults and/or look at the confidence scores / copy number likelihoods. You can also evaluate whether you see excess calls in particular samples (which can be an indication of excess variability in the read depth signal) and might suggest removing that sample from your association tests.
2. If you want to filter common vs. rare variants, one approach we have used in the past is to not test for minor allele frequency < 0.01, but rather major allele frequency < 0.99. In the multi-allelic case, this is the same as lumping all minor alleles together for computing the frequency threshold.
3. One common way to do association tests is to test the association of copy number (as a continuous quantity) to phenotype, for example using a linear model. For self-pollinating plants, I think this should be very similar to testing for a dosage effect on phenotype.
-
hello
for [One common way to do association tests is to test the association of copy number (as a continuous quantity) to phenotype],i don't understand ,
Which software can test the correlation between copy number (as a continuous number) and phenotype? Most of the software is to genotype the copy number like SNP(two alleles), and then use the general linear model GLM, such as Tassel, EMMAX.
so iI don't understand that association of copy number (as a continuous quantity) to phenotype. -
I usually do this kind of analysis myself using R (no CNV-specific package, just custom R code using glm()).
I asked a couple of colleagues, who said the same. We did a quick google search and looked at some recent publications, but from a quick reading, none of them appear to implement this kind of allelic series / dosage test. Perhaps you might find such a package with a deeper search, however. If you do, feel free to post about it!
Please sign in to leave a comment.
7 comments