Haplotypecaller Local Assembly Algorithm
Dear all
I've been looking into haplotypecaller local assembly algorithm recently. I found out something after reading related documents. That is,
- It fails to assemble when there are tens of consecutively adjacent variants in an active region.
- It fails to assemble when an active region happens to be repeat-rich(regions like this will be skipped because it may cause cycles in graph)
Does anybody know why it fails to assemble and how we improve the assembler?
Thanks in advance,
Jacky
-
Hi Jacky Xia. The HaplotypeCaller assembly engine is a somewhat complex topic and it has a lot of very specific messy reasons it can fail to assemble. There is some general advice for debugging and mitigating graph assembly failures (many of the options pertain to assembly failures) for missing variants here: https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant. Specifically the `--linked-debrujin-graph` argument in there is likely to help haplotype caller in these messy, repetitive regions. `--recover-all-dangling-branches` is likely to help in regions with lots of adjacent variants as those sites can very often cause disconnected graphs that do not adhere nicely to the reference path. If you wait for the next release of GATK (or are willing to try our nightly builds/building the current master branch of GATK) we have added a new and improved `--pileup-detection` argument that sidesteps assembly and reverts back to the read pileups to supplement assembly variants at sites where assembly might have failed to find the correct alleles.
To give a little bit more insight as to why 1 and 2 might be failing for you:- The deBrujin graph in the HaplotypeCaller expects and tries to read out paths from the start to the end of the assembly window. Very often at sites with many close together variants you can end up with very long strings of kmers in the graph that do not cleanly re-connect to the reference path because their variants cause them to mismatch the reference enough. We attempt to recover these sorts of paths (often called "dangling ends" if they are anchored on one side to the ref) by re-aligning them back to the reference and forming a valid path again, however this process is prone to failure and we are somewhat conservative in what we allow to recover those sites. The dangling end argument might help you in these cases as it makes that code somewhat more robust to errors.
- You have the right idea about graph cyles in this case. The HaplotypeCaller assembly engine fails if there is a loop in the graph and tries again at a higher kmer size. This can either resolve the issue or cause variants to disappear due to errors in the kmers in the reads and it is no guarantee that the engine can assemble a graph without loops. The Linked Debrujin Graph approach we added to the engine was an attempt to account for loops as best is possible given the reads that we have but if the region is truly very repetitive it might not be possible to re-assemble it in the engine at all even with this code.
-
Thank you James Emery for the kind reply, which has been very helpful. According to your advice, I tried re-running gatk-4.3.0 with `--linked-debrujin-graph` enabled, but the results were even worse than before. Specifically, it leads to more false positive and false negative variants. I would like to know what could be the cause of this.
Best regards,
Jacky
-
Hello Jacky Xia. We generally don't expect the Linked DeBrujin mode to increase the number of false negatives. We know that it often increases the number of false positives as it generally increases the number of sites on the reference that do get called which often means artifacts at those sites are now returned when they were previously "filtered" for the wrong reasons. There are other assembly graph tweaks to try (namely `--recover-all-danlging-ends`) but they also come with the risks of increasing the number of false positives as they very often result in messy sites emitting results at all where before there were none and thus you are introducing false positives. Ultimately you will have to examine your filtering and decide on the trade off between sensitivity and specificity for your specific application.
Please sign in to leave a comment.
3 comments