It's finally here! We've been talking about it (on the old, now archived blog!), and talking about it (with a misguidedly optimistic timeline), and talking about it (with actual technical details)… And now, I'm happy to announce that version 1.0 of the open-source DRAGEN-GATK pipeline is fully baked.
Over the course of all that time, we collaborated with the DRAGEN team at Illumina to port a number of algorithmic improvements from DRAGEN into the GATK codebase, which we released as part of GATK 4.2. However, the key DRAGEN-based improvements were not turned on by default; instead, we made them available through optional arguments, as explained in the GATK 4.2 release highlights. We did that in part because at the time, the overall pipeline was not completely finalized yet, and in part so that when the pipeline was final, you wouldn't have to upgrade to DRAGEN-GATK until you were good and ready — but you would still have access to general improvements, bug fixes and so on in the latest versions of GATK.
With the recent release of the DRAGMAP aligner, which replaces BWA MEM for read alignment, the DRAGEN-GATK pipeline is now complete! Accordingly, we updated the public WGS Germline Analysis workflow that our pipelines team uses in production (running all the steps from read alignment to per-sample variant calling, i.e. uBAM to GVCF), to include a "DRAGEN-GATK" mode that activates the optional DRAGEN-based features, including using DRAGMAP for read alignment. This workflow is intended to serve as a reference implementation, and constitutes the official release of open-source DRAGEN-GATK version 1.0.
Why yes, my sequences are aligned. Unfortunately, they're chaotic neutral.
Try out DRAGEN-GATK today
The workflow script is available through the Broad Institute's WARP repository on GitHub, along with detailed technical documentation about its various steps and configuration options. (Note that the workflow has its own version number, not related to the DRAGEN-GATK version, in the same way that GATK tools also follow their own versioning system.) There is also an accompanying document that provides summary descriptions of the methods, which you're welcome to use for writing the relevant methods section of any paper(s) that use this pipeline.
For your convenience, we also make the DRAGEN-GATK reference workflow available through Terra, our preferred cloud-based analysis platform, which is co-developed by the Broad Institute, Microsoft and Verily. You can browse the DRAGEN-GATK workspace to see how the workflow is configured to run in DRAGEN mode on a sample dataset, or clone the workspace to try out the pipeline for yourself without having to install anything, as shown in the accompanying tutorial video.
To learn more, tune into our webinar tomorrow, Thursday, Dec 2 at 10 AM ET. We'll talk about the new methods and show a live demo of DRAGEN-GATK in Terra. Feel free to post the questions you'd like to see addressed during the webinar in this section of the GATK forum (which includes the Zoom link). The webinar will be recorded, so even if you can't attend, you can still get answers to your most burning questions.
Before you go, though, there's a couple of important options I'd love to tell you about if you have 5 more minutes to read on.
Important workflow configuration options
The default settings of the DRAGEN-GATK configuration of the workflow produces results that are functionally equivalent to those output by hardware DRAGEN version 3.4.12. That means that callsets produced by either our open-source DRAGEN-GATK pipeline or Illumina's proprietary DRAGEN platform (run in the corresponding DRAGEN-GATK compatible mode) can be combined for joint analysis without having to worry about batch effects. This was a key objective of the DRAGEN-GATK project, and we're very pleased to have achieved it.
Nevertheless, we included a few additional options to give you some flexibility in case your objectives and constraints don't entirely match ours, while still conforming to the GATK Best Practices.
Falling back on BWA MEM instead of using DRAGMAP
One very important option allows you to switch the choice of aligner from DRAGMAP to BWA MEM. DRAGMAP is the only aligner that we've found to produce full functional equivalence with hardware DRAGEN, and it achieved the best accuracy in our testing, but we have found that you can produce results that are nearly as accurate with BWA MEM if you use the same masked reference (we'll have more on that later). If functional equivalence with hardware DRAGEN-processed data is not your top priority, BWA MEM is still a perfectly valid choice of aligner.
On this topic, it's important to note that we did all of our testing using the specific hg38 reference genome provided by the DRAGEN team, which uses masking to resolve some difficult situations arising from a subset of alternate contigs. We have not tested the DRAGEN-GATK pipeline on other human genome reference builds, nor on genomes of other organisms. We expect the improvements to the variant calling steps should apply equally to other organisms, but much of the improved mapping produced by the current version of DRAGMAP appears to be specific to the masked hg38 reference, so this is another case where you may prefer to stick with BWA MEM for now.
Activating "maximum quality mode"
We've worked hard over the past couple of years to combine the algorithmic strengths of both hardware DRAGEN and open-source GATK into a single pipeline, yet there remain a few minor features on either side that have not yet made it into the other codebase.
On the GATK side, HaplotypeCaller's genotyping function includes some logic to better handle "spanning events", where there is a deletion overlapping another variant. The default configuration of the open-source DRAGEN-GATK pipeline disables this feature in order to achieve full functional equivalence with hardware DRAGEN. We've found that if we override this and enable spanning event genotyping as originally designed, we can gain a little bit of additional quality in the results— at the cost of failing the functional equivalence test. The quality difference is relatively minor, but if you don't care about achieving functional equivalence with hardware DRAGEN, you may find it worth your while to re-enable this feature.
For the sake of convenience — and to make the workflow relatively robust to future developments — the DRAGEN-GATK workflow includes an optional "dragen_maximum_quality_mode" argument that enables you to do that without havingto touch the GATK options directly. You can see this in action in the DRAGEN-GATK workspace in Terra, which demonstrates both the default "Functional Equivalence" configuration and the alternative "Maximum Quality" configuration. You can also learn more about these two modes in the pipeline documentation.
Excerpt from the "Maximum Quality" workflow configuration in the DRAGEN-GATK workspace in Terra, showing the functional equivalence mode and maximum quality mode settings. These two options are mutually exclusive; if one is set to "true" (mode activated), the other must be set to "false" (mode deactivated).
Once again, we encourage you to join the webinar tomorrow or watch the recording afterward as we will be going over all this in more detail, with a live demo and a Q&A session to address questions posted this section of the GATK forum (which includes the Zoom link) and in the webinar chat.
1 comment
Any medium term plans to support multi-genome graphs reference (-hg38, grch37, hg19) pre-built hash tables? Thanks.
Please sign in to leave a comment.