This document describes the reference confidence model applied by HaplotypeCaller to generate a per-sample GVCF, invoked by
-ERC GVCF or
As explained here, HaplotypeCaller works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. At that point, we can calculate the likelihoods of each possible genotype and emit variant calls.
What that article does not explain is how HaplotypeCaller additionally estimates the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:
- Estimate the confidence that no SNP exists at the site by contrasting all reads with the REF base vs. all reads with any non-reference base.
- Estimate the confidence that no indel of size X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.
Based on this, we emit the genotype likelihoods (
PL) and compute the
GQ (from the
PLs) for the least confidence of these two models. We use a symbolic ALT allele,
NON_REF, to hold the likelihood that the site is not homozygous reference, as well as allele-specific
PL field values.
We do this at all sites in the territory covered by the analysis, including homozygous-reference sites, both inside and outside the ActiveRegions determined by HaplotypeCaller.