Using GATK as a genotyper to genotype existing variantsAnswered
I want to use GATK only as a genotyper to genotype a set of input variants (e.g. variants from 1000 Genomes). From what I have understood from the documentation (and other papers that benchmark genotypers), this can be done by using HaplotypeCaller with the --alleles flag.
However, it seems that when using the --alleles flag, GATK will not only genotyping the input variants, but also often suggest other alleles for these variants.
For instance, if I provide this input variant:
1 879801 . G T . PASS
.. then GATK outputs the following:
1 879801 . G T,A [...] 0/2
So, basically GATK has not genotyped the G/T SNP, but suggested another SNP G/A with genotype 0/2. It seems that this only happens at variant sites that are provided in the vcf provided by --alleles.
So my question is: Is it possible to use GATK to only genotype the specific input variants? I am benchmarking genotypers, and it seems a bit "unfair" that GATK suggests other alleles than what is provided as input. This results in better genotyping accuracy when comparing against a truth dataset.
Thanks for any help and advice!
Hi Ivar Grytten,
Thanks for writing into the forum! I'll do my best to try to find a resolution that works for you. What would be the expected behavior you are suggesting in this instance? To me, it would make sense that GATK would present the call that seems to exist at this location based on the evidence.
Thanks for the reply!
I would expect GATK to genotype the G/T SNP as either 0/0, 0/1 or 1/1, not to suggest another SNP G/A that is not in the input vcf.
As another example, if I ask GATK to genotype this SNP:
1 969377 . A G
.. then in my case GATK outputs:
1 969377 . A G,ATG
with genotype 0/2.
Here it has come up with an insertion that is not in the input vcf.
I guess this may just come down to a question of the definition of what a genotyper should do. I thought a genotyper should only determine the genotypes of a given set of input variants, not call new variants (which is the job of a variant caller)? For me it's not a big deal, but it's nice to clarify whether this is the intended behaviour of GATK HaplotypeCaller with the --alleles flag :)
I see, thanks for clarifying.
Yes, the --alleles flag calls variants at those positions, not just genotypes. I will follow up with my colleagues if there is a better way to do what you are trying to do but it might take some time to get a full resolution.
Hi Ivar Grytten,
I had my colleagues take a second look and agreed that the results here are as expected. For the example in your comment above, marking the site as 0/0, 0/1 or 1/1 would be an incorrect result because the site is not A>G or A.
It looks like here we are providing information on the allele you input and giving the results that are present. Is there an alternate expected behavior you are looking for that would still provide accurate results?
Thanks a lot for clarifying!
It does indeed make sense that GATK does suggest (call) alleles that it finds support for on the given sites. I just assumed that GATK with --alleles dit not behave this way since I've seen people using GATK as a pure genotyper to genotype a given set of input variants. I guess an alternative behaviour that could make sense would be to call the genotype as ./. (no support for the variant).
Anyway, this is not a big deal, I mostly wanted to clarify what was the intended behaviour. Thanks again for helping with this! :)
Glad we could help out!
Please sign in to leave a comment.