The problem: My DP and AD values don't match up!
Here's a common problem — You're trying to evaluate the support for a particular call, but the numbers in the DP (total depth) and AD (allele depth) fields aren't making any sense. For example, the sum of all the ADs doesn't match up to the DP, or even more baffling, the AD for an allele that was called is zero!
For example, sometimes a VCF may contain a variant call that looks like this:
2 151214 . G A 673.77 . AN=2;DP=10;FS=0.000;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:10:38:702,0,38
You can see in the Format field the AD values are 0 for both of the alleles. However, in the Info and FORMAT fields, the DP is 10. Because the DP in the INFO field is unfiltered and the DP in the FORMAT field is filtered, you know none of the reads were filtered out by the engine's built-in read filters. And if you look at the "bamout", you see 10 reads covering the position! So why is the VCF reporting an AD value of 0?
The explanation: uninformative reads
This is not actually a bug -- the program is doing what we expect; this is an interpretation problem. The answer lies in uninformative reads.
We call a read “uninformative” when it passes the quality filters, but the likelihood of the most likely allele given the read is not significantly larger than the likelihood of the second most likely allele given the read. Specifically, the difference between the Phred scaled likelihoods must be greater than 0.2 to be considered significant. In other words, that means the most likely allele must be 60% more likely than the second most likely allele.
Before continuing, let's take a moment to understand the types of AD and DP, and what they mean.
"INFO" field DP
- this includes all reads that overlap a variant after local realignment, which pass the quality filters."FORMAT" field DP
- this includes only informative reads for that sample."FORMAT" field AD
- this includes only informative reads for each allele. The sum of all AD = DP (unless the alleles exceed the max alt alleles limit).
Now, let’s walk through an example to make this clearer. Let’s say we have 2 reads and 2 possible alleles at a site. All of the reads have passed HaplotypeCaller’s quality filters, and the likelihoods of the alleles given the reads are in the table below.
Reads | Likelihood of A | Likelihood of T |
---|---|---|
1 | 3.8708e-7 | 3.6711e-7 |
2 | 4.9992e-7 | 2.8425e-7 |
Note: Keep in mind that HaplotypeCaller marginalizes the likelihoods of the haplotypes given the reads to get the likelihoods of the alleles given the reads. The table above shows the likelihoods of the alleles given the reads. For additional details, please see the HaplotypeCaller method documentation).
Now, let’s convert the likelihoods into Phred-scaled likelihoods. To do this, we simply take the log of the likelihoods.
Reads | Phred-scaled likelihood of A | Phred-scaled likelihood of T |
---|---|---|
1 | -6.4122 | -6.4352 |
2 | -6.3011 | -6.5463 |
Now, we want to determine if read 1 is informative. To do this, we simply look at the Phred scaled likelihoods of the most likely allele and the second most likely allele. The Phred scaled likelihood of the most likely allele (A) is -6.4122.The Phred-scaled likelihood of the second most likely allele (T) is -6.4352. Taking the difference between the two likelihoods gives us 0.023. Because 0.023 is Less than 0.2, read 1 is considered uninformative.
To determine if read 2 is informative, we take -6.3011-(-6.5463). This gives us 0.2452, which is greater than 0.2. Read 2 is considered informative.
How does a difference of 0.2 mean the most likely allele is ~60% more likely than the second most likely allele? Well, because the likelihoods are Phred-scaled, 0.2 = 10^0.2 = 1.585 which is approximately 60% greater.
Informative reads, uninformative reads, DP, and AD
So, now that we know the math behind determining which reads are informative, let’s look at how this affects the record output to the VCF. If a read is considered informative, it gets counted toward the AD and DP of the variant allele in the output record. If a read is considered uninformative, it is counted towards the DP, but not the AD. That way, the AD value reflects how many reads actually contributed support for a given allele at the site. We would not want to include uninformative reads in the AD value because we don’t have confidence in them.
Please note, however, that although an uninformative read is not reported in the AD, it is still used in calculations for genotyping. In future we may add an annotation to indicate counts of reads that were considered informative vs. uninformative. Let us know in the comments if you think that would be helpful.
In most cases, you will have enough coverage at a site to disregard small numbers of uninformative reads. Unfortunately, sometimes uninformative reads are the only reads you have at a site. In this case, we report the potential variant allele, but keep the AD values 0. The uncertainty at the site will be reflected in the QG and PL values.
5 comments
Thanks, this is a well explained article about a much needed missing information. I wonder if is there any parameter to change the 'difference between the Phred scaled likelihoods' from .2 to the desired number?
It'll be helpful to report read coverage after filtering for each sample in the output of GenotypeGVCFs
Thank you for this explanation. This clarifies some questions. However, I continue to be confused.
The "The explanation: uninformative reads" section of the article suggests that there should sometimes be two different DP values.
1. DP-info: how many reads cover the position.
2. DP-format: how many reads are informative to distinguish two different alleles.
Then, the second type of reads should also be reported in AD, and there sum(AD)=DP-format.
All this sounds reasonable. But the vcf line example does not agree with this explanation. There, both DP values are 10, and sum(AD)=0.
Is there a difference between DP-info and DP-format?
Hello,
Based on the information, following variant is homozygous based on FORMAT DP or INFO DP?
When i check on IGV the variant looks heterozygous with following read counts:
chr1:228346358;total count 150;ref G count 150;ins count 1.
chr1:228346359;total count 157;ref C count 157;ins count 80.
chr1 228346358 . G GCCGCCGCGGCCCCCCGGCCT 3040.03 . AC=2;AF=1.00;AN=2;DP=149;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=30.51;SOR=2.105 GT:AD:DP:GQ:PL 1/1:0,72:72:99:3054,203,0
So how to be sure the variant is hom alt or het?
Thank you
Hello,
I have this problem that I have encountered in regards to HaplotypeCaller's -bamout produced realigned BAM and AD according to VCF. The VCF has been done with 63 samples through the HaplotypeCaller -> CombineGVCF -> GenotypeGVCF pipeline. I have produced IGV pictures from these HaplotypeCaller created BAMs. Below is all samples' FORMAT field in the order they are in the IGV picture (there are many sites like this in other IGV pictures also):
0|1:20,12:32:99:0|1:12504_A_G:444,0,804:12504
0/0:48,0:48:99:.:.:0,109,1726
0/0:51,0:51:99:.:.:0,120,1800
The position in question in the VCF and IGV pictures is Contig31:12504. So the last sample is of interest as it shows AD of 51,0 and GQ of 99 and despite that it doesn't show the reads. Have you any idea why? -bamout option in HC won't produce uninformative reads but these shouldn't be uninformative, right?
Please sign in to leave a comment.