Theorical model for Active region detection
AnsweredHello,
I'm interesting in understanding the entire haplotype caller algorithm deeply, but I cannot find the exact explanation of how an active region is detected (how the active probability is computed for example, how the smoothing is computed...). Could you provide me a paper that explains the algorithm behind active region detection or the mathematical model that applies to active region?
Thank you,
Naomie.

Hi Naomie,
Here is an article on how active regions are calculated in HaplotypeCaller. Hopefully, this will answer your question.
Kind regards,
Pamela

Hi Pamela,
thank you for your quick answer. I still have trouble understanding how the per score probability is computed: it is explained that the reference confidence model is used, but in this other post on reference confidence model (https://gatk.broadinstitute.org/hc/enus/articles/360035531532HaplotypeCallerReferenceConfidenceModelGVCFmode) I don't find an explanation of the exact formula needed to compute these probabilities, do you use bayes' theorem to compute the following point? :
"Estimate the confidence that no indel of size X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently"
(especially the last part of the sentence) and if not then how to you compute it?thank you,
Naomie. 
Hi Naomie Abecassis,
HaplotypeCaller does use Bayes' Theorem to calculate the genotype probability. This article includes a breakdown of the specific formulas that are used to calculate these probabilities:
Please let me know if this does not answer your question.
Kind regards,
Pamela

Hi Pamela,
actually I was talking about the score probability computed to define whether a region is active or not. The article that you mentioned in your first answer (ActiveRegion determination (HaplotypeCaller and Mutect2) – GATK (broadinstitute.org)) explains that:
1. for each locus an activity score is calculated and
2. then from that, the raw activity profile is smoothed to get the actual activity profile
3. and finally comparing it to a given threshold allows to set a region as active.
About step 1 it is written: "perposition score is the probability that the position contains a variant as calculated using the referenceconfidence model applied to the original alignment"
My question is what is the formula to calculate this above probability ? (i.e. what does reference confidence model means in this case?)
Thank you,
Naomie.

Hi Naomie Abecassis,
I have been looking into this and trying to find some information on the algorithms behind active region determination. I was able to find this resource with a section specifically about active region determination on page 12: https://www.biorxiv.org/content/10.1101/201178v3.full.pdf
If this does not answer your question, I can ask the GATK developers if they have more information on the algorithms involved.
Kind regards,
Pamela

Thank you Pamela Bretscher
Please sign in to leave a comment.
6 comments