PL is a sample-level annotation calculated by HaplotypeCaller and GenotypeGVCFs, recorded in the sample-level columns of variant records in VCF files. This annotation represents the normalized Phred-scaled likelihoods of the genotypes considered in the variant record for each sample.
This article clarifies how the PL values are calculated and how this relates to the value of the GQ field.
Contents
- The basic math
- Example and interpretation
- Special case: non-reference confidence model (GVCF mode)
1. The basic math
The basic formula for calculating PL is:
$$ PL = -10 * \log{P(Genotype | Data)} $$
where P(Genotype | Data)
is the conditional probability of the Genotype given the sequence Data that we have observed. The process by which we determine the value of P(Genotype | Data)
is described here.
Once we have that probability, we simply take the log of it and multiply it by -10 to put it into Phred scale. Then we normalize the values across all genotypes so that the PL value of the most likely genotype is 0, which we do simply by subtracting the value of the lowest PL from all the values.
The reason we like to work in Phred scale is because it makes it much easier to work with the very small numbers involved in these calculations. One thing to keep in mind of course is that Phred is a log scale, so whenever we need to do a division or multiplication operation (e.g. multiplying probabilities), in Phred scale this will be done as a subtraction or addition.
2. Example and interpretation
Here’s a worked-out example to illustrate this process. Suppose we have a site where the reference allele is A, we observed one read that has a non-reference allele T at the position of interest, and we have in hand the conditional probabilities calculated by HaplotypeCaller based on that one read (if we had more reads, their contributions would be multiplied -- or in log space, added).
Please note that the values chosen for this example have been simplified and may not be reflective of actual probabilities calculated by Haplotype Caller.
# Alleles Reference: A Read: T # Conditional probabilities calculated by HC
P(AA | Data) = 0.000001 P(AT | Data) = 0.000100 P(TT | Data) = 0.010000
Calculate the raw PL values
We want to determine the PLs of the genotype being 0/0, 0/1, and 1/1, respectively. So we apply the formula given earlier, which yields the following values:
Genotype | A/A | A/T | T/T |
---|---|---|---|
Raw PL | -10 * log(0.000001) = 60 | -10 * log(0.000100) = 40 | -10 * log(0.010000) = 20 |
Our first observation here is that the genotype for which the conditional probability was the highest turns out to get the lowest PL value.
This is expected because (as described in the following link) PL is the Phred-scaled likelihood of the genotype, and is how much less likely a specified genotype is compared to the best one, as calculated using a PL formula specified at this link. This means (rather unintuitively) that it is the probability that the genotype is not correct. In other words, low PL values mean a genotype is more likely, and high PL values means it’s less likely.
Normalize
At this point we have one more small transformation to make before we emit the final PL values to the VCF: we are going to normalize the values so that the lowest PL value is zero, and the rest are scaled relative to that. Since we’re in log space, we do this simply by subtracting the lowest value, 20, from the others, yielding the following final PL values:
Genotype | A/A | A/T | T/T |
---|---|---|---|
Normalized PL | 60 - 20 = 40 | 40 - 20 = 20 | 20 - 20 = 0 |
We see that there is a direct relationship between the scaling of the PLs and the original probabilities: we had chosen probabilities that were each 100 times more or less likely than the next, and in the final PLs we see that the values are spaced out by a factor of 20, which is the Phred-scale equivalent of 100. This gives us a very convenient way to estimate how the numbers relate to each other -- and how reliable the genotype assignment is -- with just a glance at the PL field in the VCF record.
Genotype quality
We actually formalize this assessment of genotype quality in the GQ annotation, as described also here.The value of GQ is simply the difference between the second lowest PL and the lowest PL (which is always 0). So, in our example GQ = 20 - 0 = 20. Note that the value of GQ is capped at 99 for practical reasons, so even if the calculated GQ is higher, the value emitted to the VCF will be 99.
3. Special case: non-reference confidence model (GVCF mode)
When you run HaplotypeCaller with -ERC GVCF
to produce a gVCF, there is an additional calculation to determine the genotype likelihoods associated with the symbolic <NON-REF>
allele (which represents the possibilities that remain once you’ve eliminated the REF allele and any ALT alleles that are being evaluated explicitly).
The PL values for any possible genotype that includes the <NON-REF>
allele have to be calculated a little differently than what is explained above because HaplotypeCaller cannot directly determine the conditional probabilities of genotypes involving <NON-REF>
. Instead, it uses base quality scores to model the genotype likelihoods.
More information is available in this hyperlinked article.
1 comment
Can you confirm that this definition of GQ is different than the GQ defined in the VCF spec? From the spec:
GQ (Integer): Conditional genotype quality, encoded as a phred quality −10log10 p(genotype call is wrong, conditioned on the site’s being variant).
It appears the normalization step that is performed would give a different answer than what is defined by the spec.
Please sign in to leave a comment.