Artificial haplotype generates a new variant for a homozygous site
Hi,
I am using the following command to call variants:
gatk HaplotypeCaller \
-R dmel-all-chromosome-r5.44.fasta \
-I ../121002_I637_FCD1B4GACXX_L1_SZAXPI015086-62.sort.noDup.realign.bam \
-L 3R:10100000-10400000 \
--dont-use-soft-clipped-bases true \
--bamout test_62_3R10209297.bam \
-O test_62_3R10209297_NOsoftclip.g.vcf
For a position 3R:10209297, HaplotypeCaller reports it as a heterozygous as :
GT:AD:DP:GQ:PL 0/1:17,39:56:99:1586,0,568
3R 10209297 . G T 1578.64 . AC=1;AF=0.500;AN=2;DP=63;ExcessHet=3.0103;FS=4.152;MLEAC=1;MLEAF=0.500;MQ=59.39;MQRankSum=-0.621;QD=28.19;SOR=1.179 GT:AD:DP:GQ:PL 0/1:17,39:56:99:1586,0,568
But when I checked the input bam file in IGV, I find it's an alternative homozygous site (T/T), no read supports the reference (upper track in the following figure).
[upper track: input bam file; bottom track: bamout bam file and grouped/colored by "HC" ]
However, when I checked the --bamout output file (the bottom track in the above figure), it turns out to be a heterozygous site with some reads supporting G allele. And all the reads containing G at this position are from HC tagged artificial haplotype! Clearly, the artificial haplotype changes the position from homoAlt -> Hetero by introducing a G allele at this position.
The genotype, as indicated in the gvcf file, is a a high quality call (high AD and GQ), it's unlikely to be filtered out.
But I think it's more likely to be a wrong genotype! Because the individual's father:
1/1:0,24:24:72:.:.:1080,72,0:.
The individual's mother:
1/1:0,27:27:81:.:.:1215,81,0:.
Both parents are homoAlt, with ZERO reads having a reference allele. Therefore, I believe their child's genotype should also be 1/1, rather than 0/1. (Except a new mutation, which is very very rare.)
It seems the artificial haplotype causes the problem. But I have no idea how could it introduce a new allele, and I am wondering if it's possible to disable artificial haplotype.
Thanks!
-
Hi Yiguan Wang, this has been discussed on the forum previously, please see this post: https://gatk.broadinstitute.org/hc/en-us/community/posts/360058309891-how-to-get-rid-of-haplotypecaller-variants-caused-by-weird-artificial-haplotypes
You might also want to check out the Genotype Refinement workflow for germline short variants.
Best,
Genevieve
-
Hi Genevieve, thanks a lot for your reply.
I've read the post you mentioned, but I don't think it answers my confusion. I did some post-HaplotypeCaller filtering, but I am afraid it may not remove these variants. As I showed in the example, "3R:10209297 GT:AD:DP:GQ:PL 0/1:17,39:56:99:1586,0,568" is a high quality variant.
I also followed the Refinement workflow to do "CalculateGenotypePosteriors" and "VariantAnnotator" providing a pedigree file which includes Father+Mother + 12 Children. I find many "hiConfDeNovo" and "loConfDeNovo" in these samples, but for the "sample62_3R:10209297", nothing changed - neither identified as de novo (1/1 1/1 -> 0/1) nor refined as non-variant (1/1).
3R 10209297 . G T 31555.30 . AC=27;AF=0.964;AN=28;BaseQRankSum=1.76;DP=867;ExcessHet=3.0103;FS=0;InbreedingCoeff=-0.037;MLEAC=27;MLEAF=0.964;MQ=59.74;MQRankSum=0;PG=26,11,0;QD=27.37;ReadPosRankSum=-1.314;SOR=0.871 GT:AD:DP:GQ:JL:JP:PGT:PID:PL:PP:PS 1/1:0,24:24:23:71:11:.:.:1080,72,0:1046,23,0 1/1:0,27:27:32:71:11:.:.:1215,81,0:1181,32,0 0|1:16,39:55:99:71:11:0|1:10209297_G_T:1589,0,554:1663,0,483:10209297 1/1:0,73:73:99:.:.:.:.:3285,220,0:3311,231,0 1/1:0,39:39:99:.:.:.:.:1749,117,0:1775,128,0 1/1:0,54:54:99:.:.:.:.:2424,163,0:2450,174,0 1|1:0,33:33:99:.:.:1|1:10209297_G_T:1485,99,0:1511,110,0:10209297 1/1:1,30:33:66:.:.:.:.:1306,55,0:1332,66,0 1|1:0,31:34:99:.:.:1|1:10209297_G_T:1395,93,0:1421,104,0:10209297 1/1:0,49:49:99:.:.:.:.:2199,148,0:2225,159,0 1/1:0,85:85:99:.:.:.:.:3820,256,0:3846,267,0 1/1:0,77:77:99:.:.:.:.:3465,232,0:3491,243,0 1/1:0,75:75:99:.:.:.:.:3375,226,0:3401,237,0 1/1:0,68:68:99:.:.:.:.:3060,205,0:3086,216,0
[sample order: father - mother - sample62 ... other 11 samples]
My major concern is the artificial haplotypes and their strange behaviours, how could "HaplotypeCaller" create these haplotypes with a new allele which looks like a de novo mutation.
-
Hello Yiguan Wang,
Here are a few options to run that can potentially help this issue or give insight into why this issue is coming up:
- Run with the --linked-debrujin-graph flag which will run a linked de brujin graph in the reassembly step.
- Restrict the reads to only ones that overlap the variant with the flag --allele-informative-reads-overlap-margin.
We want to determine if this is caused by phasing or the realignment by HaplotypeCaller, and those steps above can help narrow things down.
A couple questions: What version of HaplotypeCaller are you running? And could you send another screenshot, with the bamout organized by read groups and the base context at that site?
Thank you,
Genevieve
Please sign in to leave a comment.
3 comments