Hello, a while back I noticed that VCFs produced from the HaplotypeCaller-in-gVCF-mode pipeline often had heterozygous genotypes when the allele depths listed in the VCF file only supported one allele.
As it turned out, the source of this was that the GATK pipeline does local realignment of reads before calling genotypes; the genotype listed in the VCF file is the one determined after local realignment, while the allele depths appear to be the ones from before local realignment.
This explained why VCF files had genotypes that did not correspond to the listed allele depths, but it opened up a can of worms that was never really answered.
In the mean time, I've been working with these data and I am now convinced that this is a pretty serious problem. Questions:
1) Is the listing of allele depths from before local realignment a universal feature of the GATK pipeline, or is it something that has happened uniquely in my analysis, or because I mistakenly put in the wrong flag somewhere along the way?
2) If this is unique to me, how can I get GATK to emit post-local-realignment allele depths?
3) If this is a general feature of GATK, it is a straight-up bug. Ideally, VCFs should include post-local-realignment allele depths. If that is not possible for some reason, they should include no allele depths at all. The ones that are provided are worse than useless, because they are not describing the same loci as the genotypes! Anyone who wants to do post-GATK filtering like eliminating loci with ultra-high read depth (a pretty common filtering step) is unable to do so, but probably thinks that he or she can. This is a problem.
4) If this is a general feature of GATK, are downstream GATK tools being tricked by this? It is possible to do variant filtration based on "AD" and "DP", so I suspect that they are, at least in those cases.
I am working with mosquito exome data, 150bp PE illumina reads, GATK 188.8.131.52, java 1.8.0_252. I've been following the Broad best practices pretty closely, with a single round of bootstrapped base recalibration. Most of my individuals are diploid, but I have a few pools of 8-11 individuals each. The problems I've been having are in the diploid individuals (I haven't investigated the pools).
Here is an example line from the vcf file:
LG3.97 313691 . C T 10336.55 PASS AC=108;AF=0.931;AN=116;BaseQRankSum=-1.383e+00;DP=227;ExcessHet=0.0364;FS=14.356;InbreedingCoeff=0.4677;MLEAC=114;MLEAF=0.983;MQ=60.00;MQRankSum=0.00;QD=27.95;ReadPosRankSum=0.887;SOR=0.555 GT:AD:DP:GQ:PGT:PID:PL:PS
I have bolded one individual in which the problem behavior can be seen, and provided a IGV view of a .bamout to show what is happening (top part is the local realignment, bottom part is the original .bam file):
The VCF was most recently produced by:
$gatk SelectVariants --reference $loc_genome \
--variant $input_vcf \
--restrict-alleles-to BIALLELIC \
Happy to provide more of the pipeline if it would be useful.
Please sign in to leave a comment.