Variant not called with HaplotypeCaller
AnsweredHi,
I'm using GATK Haplotype Caller version 4.1.8.0 for germline sample (I also tried 4.2.0.0 but I have the same problem). I used only required options.
I'm looking for a real variant which is present on the alignment but which is not called by GATK. This variant is located in a complex region, it is a substitution of the first base of a ~30 bp homopolymer. Alignments are correct on R1 and variant is present at 50 % but on R2, reads are soft clipped and the variant is often part of the clipped bases.
Can you help me to find a way to call this variant which is real and pathogenic.
Thanks for your help
Flora
-
Hello flora ponelle,
Check out some of our resources that can help figure out why you are seeing this:
-
When HaplotypeCaller and Mutect2 do not call an expected variant
-
Check out the HaplotypeCaller option -bamout to view realigned reads
Make sure that you are following our best practices including data pre-processing and variant calling workflows.
Best,
Genevieve
-
-
Dear Genevieve Brandt,
Thanks for your answer. To begin with, I want to confirm I am following your best practices for variant calling and pre-processing data.
I restarted my analysiswith the following options:
--bamout (in which my variant is present)
--debug
My expected variant is chr2:g.47641560A>T but with debug option I see that it is found at multiple positions due to the homopolymer left alignment which generates deletions of several sizes.
09:21:33.922 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 47641559 with alleles = [TAAAAAAAAA*, T, TAAAA, TAAAAA, TAAAAAA, TAAAAAAA, TAAAAAAAA]
09:21:33.938 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 47641560 with alleles = [A*, *, T]
09:21:33.945 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 47641561 with alleles = [A*, *, T]
09:21:33.950 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 47641562 with alleles = [A*, *, T]
09:21:33.952 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 47641563 with alleles = [A*, *, T]
09:21:33.955 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 47641564 with alleles = [A*, *, T]
09:21:33.958 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 47641565 with alleles = [A*, *, T]
09:21:33.960 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 47641569 with alleles = [A*, T]Do you have any ideas for parameters that I can change? I have tried different sizes of kmer but the variant is still not called.. -
flora ponelle Could you share the details of this site in the GVCF? And also the HaplotypeCaller command you are running?
-
Hello Genevieve,
The command I usually use is:
gatk HaplotypeCaller
-R myref.fasta \
-I mybam.bam \
-O myvcf.vcf \
--dbsnp dbsnp.vcf \
-L mybed.bed\
--max-reads-per-alignment-start 0 \
--interval-padding 50 \
--minimum-mapping-quality 50 \
-G StandardAnnotation \
--max-mnp-distance 2Here is the site in GVCF:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 2010L0142-A_S29
chr2 47639454 . G <NON_REF> . . END=47639769 GT:DP:GQ:MIN_DP:PL 0/0:843:99:108:0,120,1800
chr2 47641309 . G <NON_REF> . . END=47641558 GT:DP:GQ:MIN_DP:PL 0/0:1709:99:181:0,120,1800
chr2 47641559 rs768805063;rs779102258;rs780636964;rs1057519253;rs1397795100;rs1408193254;rs752529320;rs886056135;rs11309117 TAAAAAAAAAAAA T,TAAA,TAAAAAAA,TAAAAAAAA,TAAAAAAAAA,TAAAAAAAAAAA,<NON_REF> 336.02 . BaseQRankSum=0.572;DB;DP=236;ExcessHet=3.0103;MLEAC=0,0,0,0,1,1,0;MLEAF=0.00,0.00,0.00,0.00,0.500,0.500,0.00;MQRankSum=-0.385;RAW_MQandDP=846837,236;ReadPosRankSum=1.470 GT:AD:DP:GQ:PL:SB 5/6:7,3,2,2,2,6,12,6:40:74:353,491,3588,531,2547,3302,495,1980,2696,2726,213,1201,1239,1282,1067,96,741,782,808,656,592,106,278,319,294,74,0,130,491,1501,1540,1504,1065,737,291,1389:1,2,19,10
chr2 47641560 rs193922376 A *,T,<NON_REF> 21.04 . BaseQRankSum=-0.431;DB;DP=200;ExcessHet=3.0103;MLEAC=2,0,0;MLEAF=1.00,0.00,0.00;MQRankSum=0.000;RAW_MQandDP=720000,200;ReadPosRankSum=-1.150 GT:AD:DP:GQ:PL:SB 1/1:2,33,2,4:41:31:1087,31,0,1145,174,2247,1203,217,1793,1732:2,0,27,12
chr2 47641561 rs749778569 A *,T,<NON_REF> 0 . BaseQRankSum=3.049;DB;DP=171;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQRankSum=-0.614;RAW_MQandDP=613521,171;ReadPosRankSum=2.194 GT:AD:DP:GQ:PL:SB 0/1:14,22,5,5:46:99:669,0,205,732,235,1992,830,363,1542,1542:10,4,22,10
chr2 47641562 . A *,T,<NON_REF> 0 . BaseQRankSum=0.217;DP=145;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQRankSum=-1.596;RAW_MQandDP=519237,145;ReadPosRankSum=1.482 GT:AD:DP:GQ:PL:SB 0/1:31,16,2,6:55:99:366,0,765,560,705,2190,590,845,1767,1756:26,5,13,11
chr2 47641563 . A *,T,<NON_REF> 0 . BaseQRankSum=2.328;DP=136;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQRankSum=-2.338;RAW_MQandDP=486837,136;ReadPosRankSum=2.657 GT:AD:DP:GQ:PL:SB 0/1:44,10,0,6:60:99:194,0,1326,433,1285,2530,438,1391,2145,2114:36,8,8,8
chr2 47641564 . A *,T,<NON_REF> 0 . BaseQRankSum=1.955;DP=132;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQRankSum=-2.752;RAW_MQandDP=472437,132;ReadPosRankSum=2.833 GT:AD:DP:GQ:PL:SB 0/1:57,8,1,6:72:86:86,0,2084,322,1976,2820,338,2100,2582,2607:48,9,7,8
chr2 47641565 . A *,T,<NON_REF> 0 . BaseQRankSum=2.085;DP=127;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQRankSum=-2.923;RAW_MQandDP=454437,127;ReadPosRankSum=2.924 GT:AD:DP:GQ:PL:SB 0/1:60,6,2,6:74:24:24,0,2586,252,2214,2955,293,2478,2754,2910:51,9,6,8
chr2 47641566 . A <NON_REF> . . END=47641566 GT:DP:GQ:MIN_DP:PL 0/0:94:73:94:0,73,2225
chr2 47641567 . A <NON_REF> . . END=47641567 GT:DP:GQ:MIN_DP:PL 0/0:92:67:92:0,67,2215
chr2 47641568 . A <NON_REF> . . END=47641568 GT:DP:GQ:MIN_DP:PL 0/0:94:73:94:0,73,2248
chr2 47641569 . A <NON_REF> . . END=47641569 GT:DP:GQ:MIN_DP:PL 0/0:91:84:91:0,84,2188
chr2 47641570 . A <NON_REF> . . END=47641585 GT:DP:GQ:MIN_DP:PL 0/0:957:99:707:0,120,1800 -
Thank you flora ponelle! Could you also share what the VCF looks like at this site?
I think you should also try the --linked-de-bruijn-graph and then the --recover-all-dangling-branches argument. This can help in regions that are more complex. Please let me know if either of them help!
One question, is there a reason you are running HaplotypeCaller with --minimum-mapping-quality 50? The default is 10, along with the default read filters that are applied with HaplotypeCaller. Changing this parameter could definitely affect your results.
-
The only variant present in my VCF aroud this site is:
chr2 47641559 . TAAA T 284.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.271;DP=254;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.98;MQRankSum=0.000;QD=5.27;ReadPosRankSum=-1.244;SOR=0.346 GT:AD:DP:GQ:PL 0/1:38,16:54:99:292,0,743
I tried --linked-de-bruijn-graph and then the --recover-all-dangling-branches arguments gut I got tthe same results.
For the mapping quality filter, I'm aware that it could affect our results. In our targeted sequencing design we target genes that have pseudogenes. To be sure to keep only reads aligned on our genes of interest we filter reads that have mapping quality score alignment < 50 even if it means losing read depth. So we apply the same filter at the stage of variant calling to be sure to call only variants on our genes of interest, while being aware that for these regions the information may be incomplete.
-
Okay, thanks for the update. Could you try running HaplotypeCaller in this region without the --minimum-mapping-quality 50 argument? From your IGV screenshots, I can see that the depth at this site is decreasing from 585 to 128 after HaplotypeCaller. It would be interesting to see what this site is like without the extra read filter.
-
On my IGV screenshot the bam file after HaplotyeCaller had already been created without --minimum-mapping-quality 50 argument ( I got lost in all my tests..)
To be sure I just did it again using my usual command with the option --minimum-mapping-quality 50:
and without the option --minimum-mapping-quality 50:
What is surprising is to have more reads (150) with the filter than without (127), but reads seem filtered for another reason
The region after my variant of interest is complex: homopolymer + repeat region.
First read (R1) align relatively well around the variant, on the other hand R2 alignments are of very poor quality due to the homopolymer which is just before the variant:
Does GATK filter the read pairs if one of the two reads alignment is bad?
-
Thanks for testing that. There are default read filters in HaplotypeCaller that you can read about on the Tool Docs page: https://gatk.broadinstitute.org/hc/en-us/articles/360056969012-HaplotypeCaller. The read pair is not filtered out, just the low quality read.
You can see how the reads were filtered out on the stack trace. If you run HaplotypeCaller in just this region, it will let you know how many reads were filtered out from each read filter to get more information.
When you do that, could you also share the GVCF line without --minimum-mapping-quality 50?
-
With minimum-mapping-quality = 10:
07:54:20.545 INFO HaplotypeCaller - 6 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
7 read(s) filtered by: NotSecondaryAlignmentReadFilter
3891 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilterWith minimum-mapping-quality = 50:
07:52:50.964 INFO HaplotypeCaller - 124 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
7 read(s) filtered by: NotSecondaryAlignmentReadFilter
3845 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilterI filter the PCR duplicates in IGV, so the only difference in IGV between my bam files before and after HaplotypeCaller is the filter on the mapping score.The problem is that we filter more reads according to thestack trace when mapping quality filter is set to 50 but in the bam after HaplotypeCaller there are more reads displayed...Also in the recalibrated bam there are 585 reads at the site and 150 in the bam after HaplotypeCaller, so we should find more filtered reads in stack trace ? -
Here is the gvcf without --minimum-mapping-quality 50:
chr2 47641559 rs768805063;rs779102258;rs780636964;rs1057519253;rs1397795100;rs1408193254;rs752529320;rs886056135;rs11309117 TAAAAAAAAAAAA T,TAAA,TAAAAAAA,TAAAAAAAA,TAAAAAAAAA,TAAAAAAAAAAA,<NON_REF> 280.02 . BaseQRankSum=0.125;DB;DP=224;ExcessHet=3.0103;MLEAC=0,0,0,0,1,1,0;MLEAF=0.00,0.00,0.00,0.00,0.500,0.500,0.00;MQRankSum=-0.511;RAW_MQandDP=800233,224;ReadPosRankSum=1.507 GT:AD:DP:GQ:PL:SB 5/6:6,3,2,1,2,7,8,7:36:37:297,401,3184,445,2223,2956,450,1721,2415,2453,148,1001,1043,1106,898,37,595,639,695,533,481,123,272,316,333,89,0,158,497,1422,1782,1783,978,690,382,1677:1,2,16,13
chr2 47641560 rs193922376 A *,T,<NON_REF> 6.52 . BaseQRankSum=-0.712;DB;DP=187;ExcessHet=3.0103;MLEAC=2,0,0;MLEAF=1.00,0.00,0.00;MQRankSum=0.000;RAW_MQandDP=673200,187;ReadPosRankSum=-1.171 GT:AD:DP:GQ:PL:SB 1/1:2,26,2,4:34:16:873,16,0,934,153,1946,981,193,1528,1469:2,0,18,14
chr2 47641561 rs749778569 A *,T,<NON_REF> 0 . BaseQRankSum=2.535;DB;DP=158;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQRankSum=-0.683;RAW_MQandDP=566721,158;ReadPosRankSum=1.111 GT:AD:DP:GQ:PL:SB 0/1:15,18,5,5:43:99:556,0,308,631,293,1785,709,422,1406,1421:11,4,15,13
chr2 47641562 . A *,T,<NON_REF> 0 . BaseQRankSum=-1.027;DP=132;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQRankSum=-1.399;RAW_MQandDP=472437,132;ReadPosRankSum=1.310 GT:AD:DP:GQ:PL:SB 0/1:24,16,2,6:48:99:454,0,579,586,570,1949,626,673,1580,1572:18,6,11,13
chr2 47641563 . A *,T,<NON_REF> 0 . BaseQRankSum=2.804;DP=123;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQRankSum=-2.212;RAW_MQandDP=440037,123;ReadPosRankSum=3.042 GT:AD:DP:GQ:PL:SB 0/1:37,9,0,6:52:99:181,0,1070,399,1055,2191,399,1137,1837,1796:26,11,7,8
chr2 47641564 . A *,T,<NON_REF> 0 . BaseQRankSum=3.365;DP=119;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQRankSum=-2.612;RAW_MQandDP=425637,119;ReadPosRankSum=3.012 GT:AD:DP:GQ:PL:SB 0/1:48,7,1,6:62:86:86,0,1738,299,1638,2443,311,1754,2211,2230:36,12,6,8
chr2 47641565 . A *,T,<NON_REF> 0 . BaseQRankSum=4.095;DP=115;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQRankSum=-3.336;RAW_MQandDP=407833,115;ReadPosRankSum=3.086 GT:AD:DP:GQ:PL:SB 0/1:51,6,1,7:65:54:54,0,2186,269,1876,2597,290,2095,2402,2521:39,12,5,9
chr2 47641566 . A <NON_REF> . . END=47641566 GT:DP:GQ:MIN_DP:PL 0/0:81:34:81:0,34,1719
chr2 47641567 . A <NON_REF> . . END=47641567 GT:DP:GQ:MIN_DP:PL 0/0:79:28:79:0,28,1662
chr2 47641568 . A <NON_REF> . . END=47641568 GT:DP:GQ:MIN_DP:PL 0/0:82:37:82:0,37,1721
chr2 47641569 . A <NON_REF> . . END=47641569 GT:DP:GQ:MIN_DP:PL 0/0:79:48:79:0,48,1670
chr2 47641570 . A <NON_REF> . . END=47641585 GT:DP:GQ:MIN_DP:PL 0/0:1039:99:744:0,120,1800
chr2 47641586 . A <NON_REF> . . END=47641586 GT:DP:GQ:MIN_DP:PL 0/0:652:0:652:0,0,13886
chr2 47641587 . G <NON_REF> . . END=47641587 GT:DP:GQ:MIN_DP:PL 0/0:558:68:558:0,68,12656
chr2 47641588 . G <NON_REF> . . END=47641597 GT:DP:GQ:MIN_DP:PL 0/0:618:99:388:0,107,1800 -
Hi flora ponelle,
Thank you for providing all this information, it is helpful to figure out what HaplotypeCaller is doing at this site. Like you said, this is a complicated site with a homopolymer so GATK can have some difficulties. From the GVCF and VCF examples, it looks like HaplotypeCaller has found more evidence for the deletion at this site and not the A>T variant. The VCF line you showed has a spanning deletion that starts one base away from your site of interest and spans the site:
chr2 47641559 . TAAA T 284.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.271;DP=254;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.98;MQRankSum=0.000;QD=5.27;ReadPosRankSum=-1.244;SOR=0.346 GT:AD:DP:GQ:PL 0/1:38,16:54:99:292,0,743
In all the IGV screenshots there is also considerable evidence for the spanning deletion at this site. I would be curious to know how you are sure that this is a real variant and I also wanted to confirm you are working with diploid human? In the GVCF it looks like at this site there is the most evidence for the deletion over the T allele (GT=1/1 for *).
chr2 47641560 rs193922376 A *,T,<NON_REF> 21.04 . BaseQRankSum=-0.431;DB;DP=200;ExcessHet=3.0103;MLEAC=2,0,0;MLEAF=1.00,0.00,0.00;MQRankSum=0.000;RAW_MQandDP=720000,200;ReadPosRankSum=-1.150 GT:AD:DP:GQ:PL:SB 1/1:2,33,2,4:41:31:1087,31,0,1145,174,2247,1203,217,1793,1732:2,0,27,12
Our support for spanning deletions is fairly new, so if you have tried older versions of GATK that would explain why you might see a different result at this site.
Please let me know if you have further questions regarding this explanation.
Best,
Genevieve
-
Yes we are working with diploid human.
Historically we did Sanger in the laboratory, this is a rare pathogenic mutation that we detected in a small number of samples in Sanger.
Recently we tested by NGS a family member of one of the person with the mutation and we were surprised not to find her. It was by looking specifically at the alignments at this position that I saw that it was present but that HaplotypeCaller did not report it.
In sanger there is no doubt about the presence of the variant, there also appears to be a 3 A heterozygous deletion. The different size of deletion reported are artifacts linked to Illumina sequencing which poorly manages long homopolymers.
In GVCF we can see that the substitution is called in g.47641560 and also to the positions g.47641561, 62 , 63 ,64 and 65. It is the same variant but because of the artifact it is called at different positions depending on the size of the deletion detected in the reads.
By imagining that we can force the right-align instead of left-align I think HaplotypeCaller should be able to call the substitution at a single site.
Do you see what I mean?Here is an example of a non mutated sample:The deletion is also visible but there is no substitution.I found this article which talks about the mutation which is difficult to find in NGS (MSH2 c.942 + 3A> T): https://onlinelibrary.wiley.com/doi/full/10.1002/cac2.12134 -
Hi flora ponelle,
Unfortunately we have not been able to find a good solution for your issue here. These long homopolymer regions are noisy and hard to resolve, especially with Illumina short read data.
We don't have any way to right align the indels as you are suggesting. Code to recover dropped variants also wouldn't help here because the issue is not that the variant is missing, it is an issue with the likelihoods for the deletions.
You may found that our new DRAGEN STR code might help reduce the likelihoods on the homopolymer indels. However, we have not fully released this method yet so we do not have good documentation released and are not fully supporting it until the release. You can read more about DRAGEN-GATK on our website and in the release notes for 4.2.0.0: https://github.com/broadinstitute/gatk/releases/tag/4.2.0.0. I'm sorry we were not able to figure out a better solution for this missing variant.
Best,
Genevieve
Please sign in to leave a comment.
14 comments