Variant not called by HaplotypeCaller (not called in father, but present in child)
Hi,
I have a bam file where an expected variant (21:34923904C>T) is not called by HaplotypeCaller.
I have followed the advice from https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant:
* When creating the GVCF (with `-ERC GVCF`), indeed GQ=0 is revealed:
21 34923904 . C <NON_REF> . . END=34923904 GT:DP:GQ:MIN_DP:PL 0/0:141:0:141:0,0,234
* When calling with `--linked-de-bruijn-graph`, the the variant `21:34923904C>CTCCATGGACTCCCAGATGTTAGCAACTAGT` is called:
21 34923904 . C CTCCATGGACTCCCAGATGTTAGCAACTAGT 1492.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.312;DP=177;ExcessHet=3.0103;FS=4.938;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=8.73;ReadPosRankSum=-5.261;SOR=0.903 GT:AD:DP:GQ:PL 0/1:63,108:171:99:1500,0,1461
* Combining `--linked-de-bruijn-graph` and `-ERC GVCF` reveals:
21 34923904 . C T,CTCCATGGACTCCCAGATGTTAGCAACTAGT,<NON_REF> 1492.60 . BaseQRankSum=0.312;DP=177;ExcessHet=3.0103;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQRankSum=0.000;RAW_MQandDP=637200,177;ReadPosRankSum=-5.261 GT:AD:DP:GQ:PL:SB 0/2:63,0,42,0:105:51:1500,51,1520,0,1456,1461,982,1713,1678,2483:29,34,24,18
(with debug output:
14:01:34.906 INFO EventMap - === Best Haplotypes === 14:01:34.906 INFO EventMap - GATGTTAGCAACTAGCTCCATGGACTCCCAGATGTTAGCAACTAGCTCCATGGACTCCCAGATGTTAGCAACTAGCACTATGGACTCCCAGATGTTAGCAACCAGTTCCATGGACTCCCAGATGTTAGCAACCAGCTCCATGGACTCCCAG
14:01:34.906 INFO EventMap - > Cigar = 151M 14:01:34.906 INFO EventMap - >> Events = EventMap{} 14:01:34.906 INFO EventMap - GATGTTAGCAACTAGCTCCATGGACTCCCAGATGTTAGCAACTAGCTCCATGGACTCCCAGATGTTAGCAACTAGTACTATGGACTCCCAGATGTTAGCAACCAGTTCCATGGACTCCCAGATGTTAGCAACCAGCTCCATGGACTCCCAG 14:01:34.906 INFO EventMap - > Cigar = 151M
14:01:34.906 INFO EventMap - >> Events = EventMap{21:34923904-34923904 [C*, T],}
14:01:34.906 INFO EventMap - GATGTTAGCAACTAGCTCCATGGACTCCCAGATGTTAGCAACTAGCTCCATGGACTCCCAGATGTTAGCAACTAGCTCCATGGACTCCCAGATGTTAGCAACTAGTACTATGGACTCCCAGATGTTAGCAACCAGTTCCATGGACTCCCAGATGTTAGCAACCAGCTCCATGGACTCCCAG
14:01:34.906 INFO EventMap - > Cigar = 76M30I75M 14:01:34.906 INFO EventMap - >> Events = EventMap{21:34923904-34923904 [C*, CTCCATGGACTCCCAGATGTTAGCAACTAGT],}
14:01:34.907 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 34923904 with alleles = [C*, T, CTCCATGGACTCCCAGATGTTAGCAACTAGT]
)
I think this is a true variant though because I have the child (index) of this sample and here the caller "correctly" identifies the variant (the variant would be het in the father and I assume it is not de novo in the child):
# Index GVCF:
21 34923904 . C T,<NON_REF> 2445.64 . BaseQRankSum=0.017;DP=281;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-2.304;RAW_MQandDP=1004722,281;ReadPosRankSum=-0.466 GT:AD:DP:GQ:PL:SB 0/1:158,120,0:278:99:2453,0,3540,2924,3893,6818:79,79,62,58
# Index VCF:
21 34923904 . C T 2445.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.017;DP=281;ExcessHet=3.0103;FS=0.918;MLEAC=1;MLEAF=0.500;MQ=59.80;MQRankSum=-2.304;QD=8.80;ReadPosRankSum=-0.466;SOR=0.761 GT:AD:DP:GQ:PL 0/1:158,120:278:99:2453,0,3540
The docs on `--linked-de-bruijn-graph` say: "If enabled, the Assembly Engine will construct a Linked De Bruijn graph to recover better haplotypes".
I assume that `--linked-de-bruijn-graph` is turned off by default for a reason.
Would it be possible to elaborate a little bit on this?
Furthermore, does anyone have a comment on my specific example here?
In IGV, the variant can be clearly seen in the index but not in the father. OTOH the father has the variant 21:34924243>T>C more than 300bps downstream (which my be completely unrelated). The father's coverate at 34923904 is ~140x and the index has ~300x according to IGV.
a) GATK version used:
I used 4.1.9.0 as well as 4.5.0.0
b) Exact command used:
#ARGS="--debug -ERC GVCF "
#ARGS="--debug -ERC GVCF --linked-de-bruijn-graph"
GATK_4190=~/scratch/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
java -jar "$GATK_4190" HaplotypeCaller -R "$GRCH37_REF_FASTA" -I ./index.bam --output index.vcf -L ./in.bed -bamout index-bamout.bam $ARGS
c) Entire program log:
Provided as needed
-
Hello Rolf Schröder. I'm sorry to hear about this issue with HaplotypeCaller. It seems like you have already found exactly the link with advice that we usually deploy in these situations. Looking at the Father output with LinkedDebrujin graph the 32 base insertion that was called for the father looks like it is perhaps some sort of failure or pathological behavior for the more complicated graph. It is difficult to tell without closer inspection/debugging exactly what happened here, its possible that is a pathological site where assembly has indeed failed and that the LinkedDebrujinGraph is also failing there.
To answer your question about LinkedDebrujinGraph. It is a more complicated graph alogrithm that fixes many faults in the existing graph but also introduces some faults of its own and while we think its better (it is part of the standard practice for Mutect2 now) we are conservative about introducing defaults that will introduce unexpected behavior to the HaplotypeCaller by default.
There are a number of other possible solutions to recover this event. First the most likely issue is that there is a repetitiveness in the reference at that site that is causing both assembly graphs to struggle (though the fact that it worked for the child somewhat complicates this story). First you could try providing a VCF file for that site with the `--alleles` argument to force the haplotype caller to assemble and call that variant. This will bypass any issues with assembly and catch the variant. Along a similar stream, in GATK 4.5.0.0 and on we have introduced a new `--pileup-detection` and associated arguments that supplements the assembly engine by looking at the pileups for SNPs and indels that assembly might have failed on. This is a use at your own risk mode however as it does increase the number of false positives called and has been optimized for use in the DRAGEN-GATK pipeline.
You could try calling using the DRAGEN-GATK caller mode `--dragen-378-concordance-mode` which changes a lot of settings and adds some new geneotying arguments to emulate the dragen algorithms for better calling. Among the changes in that mode are our recommended and tested pileup-caller arguments. -
Hey James,
thank you very much for that answer. In the meantime, I had also looked for repetitive sequences and realized that this might be the culprint. Here are my results for now. I will test your proposals for completeness, though.
# These sequences are in that genomic area
# Here is the expectec C>T change
# |
CTCCATGGACTCCCAGATGTTAGCAACTAGC (2x)
CTCCATGGACTCCCAGATGTTAGCAAC (5x)
CTCCATGGACTCCCAGATGTTAGCAACCAGC (3x)
# |
# C-or-T -
Hi James,
I tested 4.5.0.0 once again as you suggested. When running with `--alleles` using the VCF from the child, the variant is revealed:
21 34923904 . C T 1457.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.341;DP=137;ExcessHet=0.0000;FS=3.117;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=10.88;ReadPosRankSum=0.458;SOR=0.443 GT:AD:DP:GQ:PL 0/1:67,67:134:99:1465,0,1562
Using `--pileup-detection` (without `--alleles`) also reveals it (with exactly the same results). Finally, using `--dragen-378-concordance-mode`, the following results are yield:
21 34923904 . C T 39.23 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.085;DP=144;ExcessHet=0.0000;FS=2.967;MLEAC=1;MLEAF=0.500;MQ=59.62;MQRankSum=1.380;QD=0.27;ReadPosRankSum=0.535;SOR=0.484 GT:AD:DP:GP:GQ:PG:PL 0/1:74,69:143:39.23,0,70:39:0,34.77,64.77:74,0,40
Thank you very much again for your detailed answers. I understand that progress is ongoing!
Best,
Rolf
Please sign in to leave a comment.
3 comments