Bug in GATK4 variant calling followed by CalculateGenotypePosteriors
AnsweredHi, I may have found a bug in GATK4 HaplotypeCaller-> CalculateGenotypePosteriors that is emitting the wrong genotypes.
1) GATK 4.1.4.0
2) Command:
gatk HaplotypeCaller
-R ${ref_fasta} \
-I father.bam -I mother.bam -I child.bam \
-O ${output_vcf} \
-G StandardAnnotation -G StandardHCAnnotation \
-new-qual \
--dont-use-soft-clipped-bases true
This was followed by:
gatk CalculateGenotypePosteriors \
-V ~{output_vcf} \
-ped ~{pedigree} \
-supporting ~{gnomad_vcf} \
-O ~{output_cgp_vcf}
3) This emitted this variant call:
chr12 49051248 . TC T 31.42 PASS AC=2;AF=0.333;AN=6;BaseQRankSum=-0.204;DP=18;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.167;MQ=59.35;MQRankSum=0.612;PG=0,0,0;QD=15.71;ReadPosRankSum=-1.762;SOR=0.693 GT:AD:DP:FT:GQ:JL:JP:PL:PP 0/0:9,0:9:lowGQ:17:0:14:0,27,374:0,17,364 0/1:3,0:3:lowGQ:18:0:14:0,9,119:18,0,110 0/1:0,2:2:PASS:31:0:14:40,0,16:31,0,43
4) Note that I used this gnomad_vcf in the CalculateGenotypePosteriors step: gs://gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz
This variant call in the second sample (mother) doesn't make sense. It was called as 0/1, but there were 3 REF reads and 0 ALT reads. Furthermore, PL = 0,9,119, which should be a 0/0 genotype, but then PP = 18,0,110. Why did the most likely genotype shift from WT to het-ALT when there were 3 REF reads and 0 ALT reads? I suspect a bug in CalculateGenotypePosteriors.
Moreover, this variant is not in af-only-gnomad.hg38.vcf.gz, so there is no reason that the population AF reference would have affected genotype posteriors for this variant.
Thanks.
-
Hi GE
My guess is the issue is arising from the coverage of that region being so low. Here are a couple of things to test.
- Try to run Haplotypecaller individually on the second sample and see if the it assigns the same genotype at that location.
- Can you please share a IGV screen shot of the bam and bamout of the variant in the second sample (mother). I wonder if the local-assembly that Haplotypecaller performs assigns this variant as a het.
-
1. Haplotypecaller on the second sample individually does not call this variant. The variant genotype was changed by CalculateGenotypePosteriors. You can see by the AD tag that HaplotypeCaller did not detect any reads supporting the variant, yet for some reason CalculateGenotypePosteriors called it as heterozygous. This should not be possible, and almost definitely indicates a bug in CalculateGenotypePosteriors.
2. Attached is the original input BAM screenshot of the 3 samples in order, and you can see there are no reads at all for this 1 bp deletion in the second sample (I scrolled down each sample). Note it is a very messy region probably because there are a handful of polyG sequences that interfere with Illumina sequencing. I do not have the bamout, but as mentioned above, HaplotypeCaller did not detect any reads supporting the variant. So the question is not whether there were reads supporting the variant, but why did CalculateGenotypePosteriors call the variant as a heterozygous despite there not being any variants supporting it.
-
- Can you share the vcf record for this location from the Haplotypecaller output.
- Can you zoom into the variant location in IGV and send a screenshot of that? I am unable to figure out where the variant is from this image.
-
GVCF HaplotypeCaller records. Note that the indel was not called in sample 2, but was clearly called in Sample 3. CalculateGenotypePosteriors for some reason changed the genotype call in Sample 2.
Sample 2:
chr12 49051247 . C <NON_REF> . . END=49051247 GT:DP:GQ:MIN_DP:PL 0/0:37:96:37:0,96,1025
chr12 49051248 . T <NON_REF> . . END=49051248 GT:DP:GQ:MIN_DP:PL 0/0:32:67:32:0,67,833
chr12 49051249 . C <NON_REF> . . END=49051249 GT:DP:GQ:MIN_DP:PL 0/0:40:80:40:0,80,1028Sample3:
chr12 49051248 . TC *,T,<NON_REF> 37.26 . AS_RAW_BaseQRankSum=|||;AS_RAW_MQ=0.00|3600.00|7200.00|7200.00;AS_RAW_MQRankSum=|||;AS_RAW_ReadPosRankSum=|||;AS_SB_TABLE=0,0|1,0|1,1|2,0;DP=6;ExcessHet=3.0103;MLEAC=0,1,0;MLEAF=0,0.5,0;RAW_MQandDP=21600,6 GT:AD:DP:GQ:PL:SB 2/2:0,1,2,2:5:6:49,55,131,6,6,0,57,98,9,92:0,0,4,1
Zoomed in screenshot of the indel is attached.
-
The original genotype and GQ can change based on the genotype posteriors calculated by the CalculateGenotypePosteriors tool. Can you share the vcf records for this location from the Haplotypecaller output for all 3 samples. And from the CalculateGenotypePosteriors from all 3 samples. It is a little difficult to tell what exactly is happening here without all the information, hence the questions asking for more info.
-
Here is the GVCF for each of the 3 samples in order, when HaplotypeCaller is run on each sample individually:
chr12 49051248 . T <NON_REF> . . END=49051248 GT:DP:GQ:MIN_DP:PL 0/0:27:0:27:0,0,333
chr12 49051248 . T <NON_REF> . . END=49051248 GT:DP:GQ:MIN_DP:PL 0/0:32:67:32:0,67,833
chr12 49051248 . TC *,T,<NON_REF> 37.26 . AS_RAW_BaseQRankSum=|||;AS_RAW_MQ=0.00|3600.00|7200.00|7200.00;AS_RAW_MQRankSum=|||;AS_RAW_ReadPosRankSum=|||;AS_SB_TABLE=0,0|1,0|1,1|2,0;DP=6;ExcessHet=3.0103;MLEAC=0,1,0;MLEAF=0,0.5,0;RAW_MQandDP=21600,6 GT:AD:DP:GQ:PL:SB 2/2:0,1,2,2:5:6:49,55,131,6,6,0,57,98,9,92:0,0,4,1Here is the VCF when HaplotypeCaller is run jointly on all 3 samples:
chr12 49051248 . TC T 35.67 . AC=2;AF=0.333;AN=6;BaseQRankSum=-0.204;DP=20;ExcessHet=0.4576;FS=0;MLEAC=1;MLEAF=0.167;MQ=59.42;MQRankSum=0.612;QD=17.84;ReadPosRankSum=-1.762;SOR=0.693 GT:AD:DP:GQ:PL 0/0:9,0:9:27:0,27,375 0/0:3,0:3:9:0,9,119 1/1:0,2:2:6:49,6,0
And pasted from my above prior post is what CalculateGenotypePosterior outputs when running on the HaplotypeCaller VCF of the jointly called 3 samples):
chr12 49051248 . TC T 31.42 PASS AC=2;AF=0.333;AN=6;BaseQRankSum=-0.204;DP=18;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.167;MQ=59.35;MQRankSum=0.612;PG=0,0,0;QD=15.71;ReadPosRankSum=-1.762;SOR=0.693 GT:AD:DP:FT:GQ:JL:JP:PL:PP 0/0:9,0:9:lowGQ:17:0:14:0,27,374:0,17,364 0/1:3,0:3:lowGQ:18:0:14:0,9,119:18,0,110 0/1:0,2:2:PASS:31:0:14:40,0,16:31,0,43
While I think it is good to dig into these details, the bottom line is that there is a definite bug in CalculateGenotypePosteriors. The prior details don't matter so much I think, because there is no reason that CalculateGenotypePosteriors should call a samples as heterozygous when there are 0 reads supporting the ALT allele. That should be impossible.
Let me know if there is any other info you need.
Thanks.
-
Note: We seemed to have missed adding the suggestion mentioned below in our documentation and we are doing that as I write this :)
The right way (best practices) to use the supporting panel is to set this argument in CalculateGenotypePosterior when using supporting panels: --num-reference-samples-if-no-call = number of samples in your panel. This will help when the supporting panel has no variant at a site and it will infer that the samples should be hom ref at that location.
As descried in the documentation of this argument: When a variant is not seen in a panel, this argument controls whether to infer (and with what effective strength) that only reference alleles were observed at that site. E.g. "If not seen in 1000Genomes, treat it as AC=0, AN=2000", where AN=2*nSamples for human autosomes. -
Hi,
That still doesn't help fix the problem.
Here is the output after direct joint calling with Haplotype caller:
chr12 49051248 . TC T 35.67 . AC=2;AF=0.333;AN=6;BaseQRankSum=-0.204;DP=20;ExcessHet=0.4576;FS=0;MLEAC=1;MLEAF=0.167;MQ=59.42;MQRankSum=0.612;QD=17.84;ReadPosRankSum=-1.762;SOR=0.693 GT:AD:DP:GQ:PL 0/0:9,0:9:27:0,27,375 0/0:3,0:3:9:0,9,119 1/1:0,2:2:6:49,6,0
Here is the output after CalculateGenotypePosterior without the --num-reference-samples-if-no-call option, using the gnomAD reference (af-only-gnomad.hg38.vcf.gz) that shows a false heterozygous call in the second sample.
chr12 49051248 . TC T 35.67 . AC=2;AF=0.333;AN=6;BaseQRankSum=-0.204;DP=20;ExcessHet=0.4576;FS=0;MLEAC=1;MLEAF=0.167;MQ=59.42;MQRankSum=0.612;PG=0,0,0;QD=17.84;ReadPosRankSum=-1.762;SOR=0.693 GT:AD:DP:GQ:JL:JP:PL:PP 0/0:9,0:9:16:0:13:0,27,375:0,16,364 0/1:3,0:3:18:0:13:0,9,119:18,0,110 0/1:0,2:2:21:0:13:49,6,0:34,0,21
Here is the output after CalculateGenotypePosterior with the --num-reference-samples-if-no-call option set to 20314 (# of samples in that gnomAD reference) that now is homozygous reference for all 3 samples:
chr12 49051248 . TC T 35.67 . AC=0;AF=0;AN=6;BaseQRankSum=-0.204;DP=20;ExcessHet=0.4576;FS=0;MLEAC=1;MLEAF=0.167;MQ=59.42;MQRankSum=0.612;PG=0,43,89;QD=17.84;ReadPosRankSum=-1.762;SOR=0.693 GT:AD:DP:GQ:JL:JP:PL:PP 0/0:9,0:9:59:0:13:0,27,375:0,59,453 0/0:3,0:3:25:0:13:0,9,119:0,25,181 0/0:0,2:2:9:0:13:49,6,0:0,9,76
So the results are unstable and swinging between false positive to false negative. Variant calling for such a variant should be more stable. The variant is extremely clear in the raw data in IGV.
What do you advise?
Thanks.
-
HI GE
I have 2 more questions for you. Please bear with me and I will hopefully have a solution/workaround for you soon.
- Are all the steps from variant calling, joint calling and CalculateGenotypePosterior run on just the three samples you have shared above?
- Can you please share your ped file?
-
Thanks for all your efforts.
1. Yes
2. Ped file:
Family1 Sample1 0 0 1 1
Family1 Sample2 0 0 2 1
Family1 Sample3 Sample1 Sample2 1 2 -
Hi GE
Apologies for the delay, I was on vacation. We are looking into this and will have an update for you soon.
-
Hi G E
We looked into it and we think that CalculateGenotypePosterior is doing what it is supposed to here.
In the case where the offspring has het var, there is one mendelian violation. Due to the low coverage of reads in that region, what the CalculateGenotypePosterior tool is saying is that the probability of having only seen the ref reads and not the alt reads in the low coverage region is more than there actually being a de-novo mutation. And therefore, the mother’s GT gets changed to eliminate the de novo in the first case.
In the second case where the offspring is hom var, there are two mendelian violations - so the mother’s and offspring’s GTs are changed to eliminate the two de novos.
This is how the CalculateGenotypePosteriors is designed and is an intended outcome of the tool. It uses the information from the reads and pedigree to determine what the most probable genotype is.
I hope this explanation was helpful to you. -
Hi Bhanu, Thanks for the follow-up.
However, while I appreciate the Bayesian argument here, and the fact that it is more likely to be a "false negative dropout" in the mother than a de novo variant, which is why it changes the mother's genotype, on principle I do not think GATK should ever call a variant in a sample when there are 0 reads supporting that genotype.
In other words, there should be a hard stop on Bayesian inference when it leads to this situation. GATK should never invoke variants that do not exist, just to fit some statistical model. In this case there were 0 reads supporting the genotype in the mother, and therefore there should be no circumstances whatsoever for GATK to call the variant in the mother.
Perhaps it is a philosophical issue, but I think there must be at least 1 read to support any genotype call, regardless of Bayesian inference. De novo variants are a major source of human disease, and not assuring this basic logic this will lead to missed calls, either because both parent and proband are both made to be het or both made to be REF by GATK.
-
Thanks for sharing GE! We will take that under consideration.
Please sign in to leave a comment.
14 comments