This post was written in collaboration with Severine Catreux (Illumina DRAGEN team), James Emery (Broad Institute GATK team) and members of the GATK support team.
Last September we announced a new collaboration with the DRAGEN team at Illumina to co-develop analysis methods and pipelines. If you missed the news at the time, you can read the original blog post which explains the big picture context and goals of the collaboration. Right now, we're working together to produce a unified DRAGEN-GATK pipeline for short-read variant calling that will be available both as a free open-source version (via Broad) and as a licensed hardware-accelerated version (via Illumina).
Today I want to talk about some of the algorithm improvements that were developed by the DRAGEN team, which our team is currently porting to the open-source GATK package.
If you know anything about DRAGEN, it's that it's super fast
Under the hood, DRAGEN uses a type of hardware called FPGAs (Field-Programmable Gate Arrays) to deliver phenomenal speed-ups to their GATK-based germline short variant discovery pipeline. Without going into the details, this reduces the end-to-end runtime from over 23 hours down to about 22 minutes on average for a single whole genome sample, starting from unmapped reads and delivering a GVCF and/or a filtered VCF. Needless to say, that is very cool. To be clear though, we don't anticipate that the work we're doing together is going to substantially accelerate the software-only version of the tools that we distribute as GATK. If you have a need for speed on that order of magnitude, you're going to want to check out Illumina's commercial DRAGEN offering, which includes the hardware-accelerated version. But what this collaboration does offer is to make available some specific accuracy improvements in the software-only open-source version (labeled DRAGEN-GATK). The good news: the outputs of the two versions (pure software vs hardware-boosted) will be functionally equivalent and compatible for cross-dataset analyses.
The other side of the DRAGEN coin is its outstanding accuracy
I'll admit, accuracy is the thing I always worry about when I hear someone has accelerated our tools. Why do I worry? First, reimplementing complex methods is tricky. Some details can get lost in translation between programming languages, and even an experienced developer might mistakenly dismiss some piece of code as inefficient and inessential without realizing that it actually plays an important role in dealing with difficult edge cases. Connected to that, validation is hard; you can't just look at one high-quality sequencing run of one sample and declare that everything works fine. You have to run on a range of different datasets that exhibit a variety of quirks and potential quality issues and verify that you can still handle all the weird corner cases.
One of the strengths of the DRAGEN team was that they understood this from the get-go, and used a wide range of datasets spanning subjects, library preparations, sequencers, read lengths, and coverage depths to ensure that their implementations were robust against a wide range of sequencing conditions. What's more, they didn't just aim to match the accuracy of the original GATK tools; they actually went out of their way to identify opportunities to improve on it. And that's where the story gets really interesting.
After analyzing enough pileups of false negative and false positive calls of datasets with curated truth sets, the DRAGEN team found two areas of opportunity that were ripe for the picking: they decided to improve the indel error model to cope with PCR-induced stutter in STR regions, and refine the genotyping model by taking into account the presence of correlated pileup errors. Let's go over each of those briefly.
New indel error model accounts for PCR-induced stutter in STR regions
The key challenge in variant calling is distinguishing various types of errors (due to contamination, library preparation artifacts, sequencing artifacts, mismapping) from true variants. Indels are particularly difficult to call reliably because indel errors are more likely in the presence of short tandem repeats (STRs), and the error probability may depend on both the period and the length of the STR. Like many other variant callers, HaplotypeCaller uses a Hidden Markov Model (HMM) to model the statistical behavior of errors as part of the probability calculation, with predetermined functions that are not sample-specific.
The DRAGEN team found that they could model the indel error process more accurately by adding sample-specific logic prior to variant calling, and taking STR period and length explicitly into account within a new method called DragSTR. The sample-specific logic is an auto-calibration step that runs on the BAM input to characterize a-priori probabilities of indel errors and indel variants. In their testing, this resulted in better detection performance, both in terms of sensitivity and precision: they recovered calls that were previously missed and removed false positives that were previously called.
Detecting correlated pileup errors improves genotyping accuracy
Like most variant callers, HaplotypeCaller assumes that sequencing and/or mapping errors are independent across reads. Following this assumption, it’s very unlikely that multiple identical errors would occur at the same locus. However, it's been observed that bursts of errors are far more common than would be predicted by the independence assumption, and that these bursts can result in lots of false positives.
The classic approach to mitigating this problem is to set up a checkpoint upstream of the genotyper to filter out reads that fail to meet certain quality thresholds, for example if their MAPQ is too low. However, this is a suboptimal way to approach the problem, since it hides valuable evidence from the variant caller. That can cause the caller to miss real variants, and fails to suppress many false positives.
The DRAGEN team developed a different approach for addressing the problem, which consists of two main pieces: 1) equip the genotyper to characterize correlated pileup errors and introduce new genotyping hypotheses and 2) with the genotyper enhancements in place, open up those checkpoint gates upstream and accept previously rejected reads into the genotyping process. Once again, they recovered calls that were previously missed and removed false positives that were previously called.
As an interesting corollary, the DRAGEN team found that these improvements to the genotyping allowed them to simplify the filtering that still remains to be done after variant calling, in part because introducing a correlated noise error model within the genotyper improved the accuracy of the QUAL score, bringing it within a range of realistic values. As a result, DRAGEN has a very simple hard-filtering rule based on the QUAL score only.
The GATK team's evaluation confirmed these accuracy gains
Our team uses evaluation methods that are a bit different from what the DRAGEN team uses, so we wanted to do our own evaluation to see how these proposed gains would translate in our testing framework. We ran the DRAGEN pipeline ourselves on our favorite test samples using a server loaned by Illumina (finally got a taste of that famous acceleration -- yeah it is fast, no kidding) and long story short, we saw the same overall improvements in sensitivity and specificity, which you can see for yourself in the figure below.
Briefly, the figure shows the precision and sensitivity for 19 technical replicates (different sequencing runs) of the NA12878 sample compared to the Genome In A Bottle (GIAB) truth set (high confidence regions). You can see the very clear gains in indel and heterozygous SNP calling precision of the DRAGEN pipeline (blue) compared to our original GATK 184.108.40.206 single-genome pipeline with 2D CNN filtering (orange), as well as a modest but always welcome increase in indel sensitivity. Nice!
Catching up to the hardware-accelerated DRAGEN-GATK
Given those results, we've determined that the DRAGEN pipeline we tested (v 3.4) is the standard that we want to aim for, so we consider it to be the hardware-accelerated version 1.0 of the DRAGEN-GATK pipeline for germline short variants. In other words, if you've been running DRAGEN 3.4, congratulations, you've been running the DRAGEN-GATK pipeline this whole time :)
Our team is now working hard to port the improvements developed by the DRAGEN team into our open-source software-only version, with the goal of making it functionally equivalent -- meaning you could run either version and get the same results, within a margin of error that we consider inconsequential. The porting work mainly involves modifications to the HaplotypeCaller, which will be released as part of the regular GATK software package. The full software-only DRAGEN-GATK pipeline will be released as a WDL workflow script in the gatk-workflows collection on GitHub.
Once that work is done, you'll be able to combine samples analyzed by either pipeline into downstream analyses without having to worry about batch effects. That's a big deal, especially given the massive datasets that are cropping up all over the place these days. Needless to say, no-one wants to reprocess 100,000 samples just for kicks.
As you can probably tell I'm very excited by the progress of this partnership, which is already delivering clear benefits just a few months in. Stay tuned for more updates as we finish up the current work -- and get ready for the next wave, because there's more in store after that.