Read counts in vcf files do not reflect local realignments
Hello, a while back I noticed that VCFs produced from the HaplotypeCaller-in-gVCF-mode pipeline often had heterozygous genotypes when the allele depths listed in the VCF file only supported one allele.
As it turned out, the source of this was that the GATK pipeline does local realignment of reads before calling genotypes; the genotype listed in the VCF file is the one determined after local realignment, while the allele depths appear to be the ones from before local realignment.
This explained why VCF files had genotypes that did not correspond to the listed allele depths, but it opened up a can of worms that was never really answered.
In the mean time, I've been working with these data and I am now convinced that this is a pretty serious problem. Questions:
1) Is the listing of allele depths from before local realignment a universal feature of the GATK pipeline, or is it something that has happened uniquely in my analysis, or because I mistakenly put in the wrong flag somewhere along the way?
2) If this is unique to me, how can I get GATK to emit post-local-realignment allele depths?
3) If this is a general feature of GATK, it is a straight-up bug. Ideally, VCFs should include post-local-realignment allele depths. If that is not possible for some reason, they should include no allele depths at all. The ones that are provided are worse than useless, because they are not describing the same loci as the genotypes! Anyone who wants to do post-GATK filtering like eliminating loci with ultra-high read depth (a pretty common filtering step) is unable to do so, but probably thinks that he or she can. This is a problem.
4) If this is a general feature of GATK, are downstream GATK tools being tricked by this? It is possible to do variant filtration based on "AD" and "DP", so I suspect that they are, at least in those cases.
My specifics:
I am working with mosquito exome data, 150bp PE illumina reads, GATK 4.1.4.0, java 1.8.0_252. I've been following the Broad best practices pretty closely, with a single round of bootstrapped base recalibration. Most of my individuals are diploid, but I have a few pools of 8-11 individuals each. The problems I've been having are in the diploid individuals (I haven't investigated the pools).
Here is an example line from the vcf file:
LG3.97 313691 . C T 10336.55 PASS AC=108;AF=0.931;AN=116;BaseQRankSum=-1.383e+00;DP=227;ExcessHet=0.0364;FS=14.356;InbreedingCoeff=0.4677;MLEAC=114;MLEAF=0.983;MQ=60.00;MQRankSum=0.00;QD=27.95;ReadPosRankSum=0.887;SOR=0.555 GT:AD:DP:GQ:PGT:PID:PL:PS
1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:0,30:30:6:.:.:1395,416,323,268,229,199,175,154,136,120,106,93,82,71,61,52,43,35,27,20,13,6,0
1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:0,33:33:9:.:.:1485,397,298,240,199,167,141,118,99,82,67,54,41,30,19,9,0
0/0/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:2,27:29:0:.:.:1167,331,250,203,170,144,123,106,91,77,66,55,46,37,30,23,17,11,6,2,0,0,58
0/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:1,47:48:6:.:.:2040,626,485,402,344,299,262,230,203,180,159,139,122,106,91,78,65,53,42,32,23,14,6,0,23
1|1:0,2:2:6:1|1:313691_C_T:90,6,0:313691
1|1:0,2:2:6:1|1:313691_C_T:90,6,0:313691
./.:0,0:0:.:.:.:0,0,0
1|1:0,2:2:6:1|1:313691_C_T:90,6,0:313691
1|1:0,3:3:9:1|1:313691_C_T:135,9,0:313691
1|1:0,3:3:9:1|1:313691_C_T:135,9,0:313691
./.:0,0:0:.:.:.:0,0,0
./.:0,0:0:.:.:.:0,0,0
./.:0,0:0:.:.:.:0,0,0
0/0:10,0:10:3:.:.:0,3,215
1|1:0,12:12:36:1|1:313691_C_T:540,36,0:313691
1|1:0,4:4:12:1|1:313691_C_T:180,12,0:313691
1|1:0,9:9:27:1|1:313691_C_T:405,27,0:313691
1|1:0,7:7:21:1|1:313691_C_T:315,21,0:313691
1|1:0,7:7:24:1|1:313691_C_T:360,24,0:313691
1|1:0,8:8:24:1|1:313691_C_T:360,24,0:313691
0|1:1,7:8:21:0|1:313691_C_T:291,0,21:313691
1|1:0,5:5:15:1|1:313691_C_T:225,15,0:313691
1|1:0,2:2:6:1|1:313691_C_T:90,6,0:313691
0/0:3,0:3:9:.:.:0,9,79
I have bolded one individual in which the problem behavior can be seen, and provided a IGV view of a .bamout to show what is happening (top part is the local realignment, bottom part is the original .bam file):
The VCF was most recently produced by:
$gatk SelectVariants --reference $loc_genome \
--variant $input_vcf \
--restrict-alleles-to BIALLELIC \
--output $output.vcf
Happy to provide more of the pipeline if it would be useful.
-
-
Hi jhb
Sorry for the delay, but here are responses to your questions:
1) Is the listing of allele depths from before local realignment a universal feature of the GATK pipeline, or is it something that has happened uniquely in my analysis, or because I mistakenly put in the wrong flag somewhere along the way?
The allele depths are given post local realignment and the math is explained here (the article I already shared above): https://gatk.broadinstitute.org/hc/en-us/articles/360035532252-Allele-Depth-AD-is-lower-than-expected
In the vcf, if a read is considered uninformative it is counted towards the DP, but not the AD. Though uninformative reads are not reported in the AD, it is still used in calculations for genotyping. If a read is considered informative, it gets counted toward the AD and DP of the variant allele in the output record.
2) If this is unique to me, how can I get GATK to emit post-local-realignment allele depths?
N/A
3) Ideally, VCFs should include post-local-realignment allele depths. If that is not possible for some reason, they should include no allele depths at all. The ones that are provided are worse than useless, because they are not describing the same loci as the genotypes! Anyone who wants to do post-GATK filtering like eliminating loci with ultra-high read depth (a pretty common filtering step) is unable to do so, but probably thinks that he or she can. This is a problem.As mentioned above all the post local realignment reads are used for calculating genotypes, so they are in fact describing the same loci as the genotypes. Ultra-high read depth is not dependent on alleles, so the site-level DP could also be used, which does not depend on any read likelihoods.
4) If this is a general feature of GATK, are downstream GATK tools being tricked by this? It is possible to do variant filtration based on “AD” and “DP”, so I suspect that they are, at least in those cases.The gnomAD team has successfully created a high-quality callset by filtering on AD and DP as follows: high-quality variants have depth >= 10, genotype quality >= 20 and minor allele balance > 0.2 for heterozygous genotypes -- see "AC_adj" in https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
Please sign in to leave a comment.
2 comments