This post was written in collaboration with Severine Catreux (Illumina DRAGEN team), James Emery (Broad Institute GATK team) and members of the GATK support team.
Last September we announced a new collaboration with the DRAGEN team at Illumina to co-develop analysis methods and pipelines. If you missed the news at the time, you can read the original blog post which explains the big picture context and goals of the collaboration. Right now, we're working together to produce a unified DRAGEN-GATK pipeline for short-read variant calling that will be available both as a free open-source version (via Broad) and as a licensed hardware-accelerated version (via Illumina).
Today I want to talk about some of the algorithm improvements that were developed by the DRAGEN team, which our team is currently porting to the open-source GATK package.
If you know anything about DRAGEN, it's that it's super fast
Under the hood, DRAGEN uses a type of hardware called FPGAs (Field-Programmable Gate Arrays) to deliver phenomenal speed-ups to their GATK-based germline short variant discovery pipeline. Without going into the details, this reduces the end-to-end runtime from over 23 hours down to about 22 minutes on average for a single whole genome sample, starting from unmapped reads and delivering a GVCF and/or a filtered VCF. Needless to say, that is very cool. To be clear though, we don't anticipate that the work we're doing together is going to substantially accelerate the software-only version of the tools that we distribute as GATK. If you have a need for speed on that order of magnitude, you're going to want to check out Illumina's commercial DRAGEN offering, which includes the hardware-accelerated version. But what this collaboration does offer is to make available some specific accuracy improvements in the software-only open-source version (labeled DRAGEN-GATK). The good news: the outputs of the two versions (pure software vs hardware-boosted) will be functionally equivalent and compatible for cross-dataset analyses.
The other side of the DRAGEN coin is its outstanding accuracy
I'll admit, accuracy is the thing I always worry about when I hear someone has accelerated our tools. Why do I worry? First, reimplementing complex methods is tricky. Some details can get lost in translation between programming languages, and even an experienced developer might mistakenly dismiss some piece of code as inefficient and inessential without realizing that it actually plays an important role in dealing with difficult edge cases. Connected to that, validation is hard; you can't just look at one high-quality sequencing run of one sample and declare that everything works fine. You have to run on a range of different datasets that exhibit a variety of quirks and potential quality issues and verify that you can still handle all the weird corner cases.
One of the strengths of the DRAGEN team was that they understood this from the get-go, and used a wide range of datasets spanning subjects, library preparations, sequencers, read lengths, and coverage depths to ensure that their implementations were robust against a wide range of sequencing conditions. What's more, they didn't just aim to match the accuracy of the original GATK tools; they actually went out of their way to identify opportunities to improve on it. And that's where the story gets really interesting.
After analyzing enough pileups of false negative and false positive calls of datasets with curated truth sets, the DRAGEN team found two areas of opportunity that were ripe for the picking: they decided to improve the indel error model to cope with PCR-induced stutter in STR regions, and refine the genotyping model by taking into account the presence of correlated pileup errors. Let's go over each of those briefly.
New indel error model accounts for PCR-induced stutter in STR regions
The key challenge in variant calling is distinguishing various types of errors (due to contamination, library preparation artifacts, sequencing artifacts, mismapping) from true variants. Indels are particularly difficult to call reliably because indel errors are more likely in the presence of short tandem repeats (STRs), and the error probability may depend on both the period and the length of the STR. Like many other variant callers, HaplotypeCaller uses a Hidden Markov Model (HMM) to model the statistical behavior of errors as part of the probability calculation, with predetermined functions that are not sample-specific.
The DRAGEN team found that they could model the indel error process more accurately by adding sample-specific logic prior to variant calling, and taking STR period and length explicitly into account within a new method called DragSTR. The sample-specific logic is an auto-calibration step that runs on the BAM input to characterize a-priori probabilities of indel errors and indel variants. In their testing, this resulted in better detection performance, both in terms of sensitivity and precision: they recovered calls that were previously missed and removed false positives that were previously called.
Detecting correlated pileup errors improves genotyping accuracy
Like most variant callers, HaplotypeCaller assumes that sequencing and/or mapping errors are independent across reads. Following this assumption, it’s very unlikely that multiple identical errors would occur at the same locus. However, it's been observed that bursts of errors are far more common than would be predicted by the independence assumption, and that these bursts can result in lots of false positives.
The classic approach to mitigating this problem is to set up a checkpoint upstream of the genotyper to filter out reads that fail to meet certain quality thresholds, for example if their MAPQ is too low. However, this is a suboptimal way to approach the problem, since it hides valuable evidence from the variant caller. That can cause the caller to miss real variants, and fails to suppress many false positives.
The DRAGEN team developed a different approach for addressing the problem, which consists of two main pieces: 1) equip the genotyper to characterize correlated pileup errors and introduce new genotyping hypotheses and 2) with the genotyper enhancements in place, open up those checkpoint gates upstream and accept previously rejected reads into the genotyping process. Once again, they recovered calls that were previously missed and removed false positives that were previously called.
As an interesting corollary, the DRAGEN team found that these improvements to the genotyping allowed them to simplify the filtering that still remains to be done after variant calling, in part because introducing a correlated noise error model within the genotyper improved the accuracy of the QUAL score, bringing it within a range of realistic values. As a result, DRAGEN has a very simple hard-filtering rule based on the QUAL score only.
The GATK team's evaluation confirmed these accuracy gains
Our team uses evaluation methods that are a bit different from what the DRAGEN team uses, so we wanted to do our own evaluation to see how these proposed gains would translate in our testing framework. We ran the DRAGEN pipeline ourselves on our favorite test samples using a server loaned by Illumina (finally got a taste of that famous acceleration -- yeah it is fast, no kidding) and long story short, we saw the same overall improvements in sensitivity and specificity, which you can see for yourself in the figure below.
Briefly, the figure shows the precision and sensitivity for 19 technical replicates (different sequencing runs) of the NA12878 sample compared to the Genome In A Bottle (GIAB) truth set (high confidence regions). You can see the very clear gains in indel and heterozygous SNP calling precision of the DRAGEN pipeline (blue) compared to our original GATK 126.96.36.199 single-genome pipeline with 2D CNN filtering (orange), as well as a modest but always welcome increase in indel sensitivity. Nice!
Catching up to the hardware-accelerated DRAGEN-GATK
Given those results, we've determined that the DRAGEN pipeline we tested (v 3.4) is the standard that we want to aim for, so we consider it to be the hardware-accelerated version 1.0 of the DRAGEN-GATK pipeline for germline short variants. In other words, if you've been running DRAGEN 3.4, congratulations, you've been running the DRAGEN-GATK pipeline this whole time :)
Our team is now working hard to port the improvements developed by the DRAGEN team into our open-source software-only version, with the goal of making it functionally equivalent -- meaning you could run either version and get the same results, within a margin of error that we consider inconsequential. The porting work mainly involves modifications to the HaplotypeCaller, which will be released as part of the regular GATK software package. The full software-only DRAGEN-GATK pipeline will be released as a WDL workflow script in the gatk-workflows collection on GitHub.
Once that work is done, you'll be able to combine samples analyzed by either pipeline into downstream analyses without having to worry about batch effects. That's a big deal, especially given the massive datasets that are cropping up all over the place these days. Needless to say, no-one wants to reprocess 100,000 samples just for kicks.
As you can probably tell I'm very excited by the progress of this partnership, which is already delivering clear benefits just a few months in. Stay tuned for more updates as we finish up the current work -- and get ready for the next wave, because there's more in store after that.
Is this single sample calling only? Is there a 2ish step process still involved (creating a gvcf like file) and then joint call?
Hi Kurt, this applies to both the single sample case and the multisample (joint calling) cases. Here we show the evaluation results from the single sample case because they're more straightforward to interpret, but the modifications we describe will benefit joint calling as well. So if you're calling variants on cohorts, you will still do the 2-step process of creating a GVCF for each sample separately then joint calling all your samples together.
Is there a plan for improving and merging Mutect for both the Dragen accelerated and non-accelerated pipelines?
What about duplicated genes?
To my knowledge, GATK does not call variants if reads are mapped to more than one locus on the genome; e.g. we never see variants of SMN1 and SMN2 genes in our annotated variants list.
Does the new DRAGEN-GATK pipeline have a solution to this challenge?
MICHAEL MCMANUS Mutect2 and the somatic short variants pipeline are on the list of use cases we want to work on together, but we haven't yet decided which will be next after the germline short variants. We'd love to hear from you all on what would be most valuable to the research community, so don't hesitate to comment.
Shahryar Alavi You're correct that GATK will only take into account uniquely mapped reads; dealing with duplicated genes is more a matter of mapping strategy. The updated DRAGEN-GATK pipeline does not change anything compared to the current behavior in that respect.
FYI our colleagues in the Genomics Platform (who do the sequencing at the Broad Institute) are presenting a poster at the AGBT meeting this week about their testing of the DRAGEN-GATK pipeline. For the lucky ducks over there, it's poster 1010 and will be presented on Tuesday Feb 25th from 4:40 pm to 6:10pm. For the rest of the world, here is the PDF: https://drive.google.com/open?id=1fa0RG9RjoPOBlqvPoH6CXL121NldqAph
We have been playing with Dragen on an FPGA AWS instance a little. For the figure above, we would be curious for your hard filter settings for DRAGEN (based on my reading of this blog this sounds like it is just quality score based) and for GATK (this might have more parameters owing to the CNN) so that we can recreate it. This would be helpful to benchmark and give us a baseline when we test the two with our own whole genomes.
Dragen is screaming fast by the way, coupled with FPGA it absolutely tears through large WGS samples. This feels like a big step toward making the processing into the thousands of whole genomes more routine. Probably also opens the door on processing data for truly massive studies. Personally, I am excited for the applications to cancer, gwas, etc that this will enable. Thanks again.
Hi Niko Balanis, glad to hear you're enjoying that blazing speed :) Let me loop in Bhanu Gandham to find out if we have the evaluation parameters in a shareable format. I'm pretty sure it's in the public github somewhere but hopefully we can get you a direct pointer so you don't have to go digging for it. When the full pipeline is ready for release we'll have a public workspace on Terra that will exercise the whole pipeline on test data to make it fully reproducible -- but I assume you'd prefer to start benchmarking sooner if possible.
Hi Niko Balanis
Very glad and excited to hear that you are already getting started with this! Here is the script that was used to filter the GATK variants: https://github.com/broadinstitute/gatk/blob/master/scripts/cnn_variant_wdl/cram2filtered.wdl . You are right, for DRAGEN, the variants were filtered based on qual scores and the default threshold used there was 10. Please note: the GATK and DRAGEN versions used in this blog are older - GATK v188.8.131.52 and DRAGEN v3.4. Since then there have been new improvements made to both. I would love to hear your thoughts once you are done with benchmarking on your end.
And as Geraldine said, we will soon have a whole pipeline and test data made publicly available. So stay tuned! :)
Hi Geraldine Van der Auwera,
Now that v184.108.40.206 is released, are there any news regarding the DRAGEN-GATK somatic short variants pipeline (Mutect2) implementations? I am using the software-only version extensively and the only way I've found to reduce the long processing time from ~24 hours to ~20 minutes is using "a lot" of intervals and running "a lot" of Mutect2 commands in parallel, which is quite expensive. DRAGEN-GATK might be an excellent cost-effective alternative.
Hi Ury Alon, the next pipeline/use case we are tackling as part of the DRAGEN-GATK collaboration is germline SV (structural variation), which is a big area of focus for us right now. We don't currently have plans to tackle somatic short variants with the DRAGEN team. Sorry to disappoint!
Are you currently using our WDL workflows to operationalize your analyses, and are you running on cloud? We have some optimizations in place to reduce cost that may help if you're not already using those.
Thanks for the prompt response, Geraldine Van der Auwera.
We are not using actual WDLs, as we use AWS Step Functions and Batch for pipeline orchestration and execution. We have done significant optimizations, but I would love to read the methods you suggest.
Ah I see. Unfortunately we don't yet have systematic documentation on the pipeline optimizations, but off the top of my head, the top three are (1) using preemptible VMs on GCP (analogous to Spot instances on AWS) which saves us a ton of money (cost is ~20% of full price, barring preemptions); (2) auto-sizing disk allocations (calculating based on input sizes and output expectations); and (3) streaming data from buckets instead of localizing entire files. You can see the WDL scripts and the summaries of cost per sample in the Mutect2 workspace here: https://app.terra.bio/#workspaces/help-gatk/Somatic-SNVs-Indels-GATK4 (it's public after sign-in, which is free/open to all).
The Terra knowledge base has docs on how to use these optimizations (which are not specific to Terra, to be clear), let me know if you're interested in specific pointers. Also there's a general cost control article here that covers a range of considerations; some of them are Google-specific but some are fairly cross-platform.
Thanks Geraldine Van der Auwera, I've been actually reading this link earlier today :)
I don't know if this is the place to get more specific, but I'll give it a try:
In terms of optimizations, we've gone way beyond what you mentioned (will be happy to share): We are using Spot instances exclusively, we use piping wherever possible (which unfortunately is not in all cases), and we use parallelization using SPARK function variants and by processing only intervals (in the case of Mutect2, for example).
I think we can further reduce processing time and resource consumption by splitting the reference we're using (gh38) to chromosomes and intervals (in Mutect2, once again), but testing it, I saw the results are not the same as when the full reference file is used. Is using only the "relevant" FASTA reference segment (for example, just chr1 when mutecting tumor chr1 and normal chr1) as a reference to Mutect something you would recommend?
Oh yes I didn't mention processing only intervals since you already mentioned doing that, but it is definitely key. I'm not entirely sure I understand what you mean by "using only the "relevant" FASTA reference segment" though -- are you feeding in different reference files corresponding to subsets? On that front, we use the whole (uncut) reference file for everything, though I think we recently started using persistent disks for the reference files to speed up the process and avoid having to copy it all to each VM. I haven't used that yet myself so I can't speak to the details, but there was a feature update on the Terra blog that mentioned this (because it's now used by default when using our built-in references on Terra): https://terra.bio/coming-soon-faster-cheaper-workflows/
The other thing we do wrt intervals is that we make "lists of interval lists". We can then parallelize execution into multiple GATK tasks, each of which runs on its own list of intervals, and we balance the lists so they all take roughly the same amount of time, which avoids eg waiting an extra hour just because one interval list has a lot more ground to cover. We've done quite a lot of tuning to figure out how widely to parallelize to get the balance of cost vs speed that is right for us (recognizing that when you parallelize more widely, you pay more for the overhead of setting up VMs etc) but it's not necessarily one size fits all. We tend to prefer cheap >> super fast for routine processing, but will occasionally dial things up if we need faster turnover at the expense of higher cost.
Great, thanks for clarifying. In our case, processing time is usually prioritized over costs, but we try to have both :)
What I meant is, should there be a difference (in terms of identifying mutations) between executing
Note that the tumor and normal samples are already split into chromosomes in both cases, and that the interval list further limits the extent of processing.
After all Mutect2 tasks are completed, we concatenate the resulting VCF into one.
Got it — that should not make any difference. The only case where things could get weird is if you're splitting up chromosomes internally, because then you could run into edge effects. But if you're simply separating entire chromosomes, it should be fine, so if you do observe differences, I would recommend posting the details in the support forum so the support team can take a look.
Awesome, thank you very much Geraldine Van der Auwera. I'll get to it :)
Are there/where there any advances on getting GenomicsDbImport to have the ability to run in parallel, or is GenomicsDbImport just limited to a single core still?
Hi Jonathan Klonowski,
Actually you can use GNU parallel to wrap the GenomicdDBImport for parallel running.
Hi Geraldine Van der Auwera,
I wonder, since the new added mode of dragen-mode for HC(which should be used on dragmap generated BAM files I suppose), can we mix the GVCFs generated by the old, traditional HC and the new dragen-mode HC in GenomicsDBImport and GenotypeGVCFs?
do you have any idea when a gatk-dragen pipeline will be released? Or at least the different steps and parameters to use to get results at least as good as with the classic gatk pipeline?
in particular does it work if we use bwa mem 2? Because I tried dragmap and so far it's way too slow.
moreover after having generated the gvcf with haplotypecaller in dragen mode we can do joint genotyping with genomicsdbimports followed by genotypegvcf?
I read that with dragen-gatk it was enough for a hard filtering but with what value if it is just a qual <10 there are no additional steps after genotypegvcf (which already eliminates the variants lower than 10)?
as you can see I really want to use these new tools ;-)
Great work. Is there a white paper or manuscript with the validation data? Thanks, Seth F.
I try this New pipeline and it's really good. But two point :
First, dragmap is very long (16h vs 7h for bwa mem2 for the same sample ans ressources).
Second, it's not clear for me if after single sample germline with dragen we should still use joint genotyping, and if yes dragen hard filtering is ok for that ?
Thanks a lot.
Other question :
The dragen mode of haplotypecaller can (should ?) Be use for variant calling on rna seq data ?
Where can I find the WDL and JSON files for DRAGEN-GATK Germline analysis v1 to be used on an HPC?
Thanks for your help
Hello, I also have the above problem. In the GVCF file, ALT column is all "NON_REF", but in VCF, it is normal. Is there any reason for this
Please sign in to leave a comment.