As you may know from previous posts, we’ve been working with the DRAGEN team at Illumina to produce a new version of our Best Practices pipeline for germline short variant calling, called DRAGEN-GATK, that combines key strengths of our respective tools.
Over the past year, our teams collaborated closely to make improvements to both DRAGEN and the open-source GATK software, including some changes to HaplotypeCaller that increase its accuracy. These improvements were released as part of GATK version 4.2, but they haven't been integrated into our Best Practices workflows yet.
That is about to change, as we are preparing to release updated workflows that will mark the official release of version 1.0 of the open-source DRAGEN-GATK. We'll go over the full details of what this release entails in a webinar and Q&A session on Thursday, Dec. 2 from 10AM-11AM EST at this Zoom link.
In the meantime, we wanted to tell you more about a key component of the new pipeline — DRAGMAP — which is set to replace BWA-MEM as the default genome mapper in the joint DRAGEN-GATK pipeline.
Alpha release of open-source DRAGMAP
In a nutshell, DRAGMAP is an open-source software implementation of the DRAGEN mapper, which the Illumina team created so that we would have an open-source way to produce the same results as their proprietary DRAGEN hardware. This was necessary because the original DRAGEN mapper is encoded in the DRAGEN hardware architecture itself, which uses FPGA technology. Until now, it didn't exist as a piece of software that you could download and install on your own computer.
The Illumina team has released an alpha version of DRAGMAP on Github under a GPLv3 license. Since this is an alpha release, it does come with a couple of caveats:
- The DRAGEN team has been prioritizing accuracy and functional equivalence, so they haven't put a lot of emphasis on speed. At this time, the runtime of the software version of DRAGMAP is expected to be roughly on par with BWA MEM's. (See the performance benchmarks at the end of this post.)
- This is a brand new software codebase, and with any new software, bugs are possible. As people test it out on a variety of datasets and in different computing environments, the team will use any reported issues and feedback to refine the software accordingly.
With that out of the way, let's have a look at some of the questions that should naturally arise from replacing such a major piece of our pipeline.
Why write a new genome mapper?
Our teams include computational biology veterans who have been using BWA, which up until now has been our preferred genome mapper, since it was first released in 2009. We're all very aware of the enormous debt of gratitude that our field owes to Heng Li for creating BWA; in its various forms, it has been run on millions of exomes and genomes, probably the most of any genome mapper so far.
Yet, as data generation technology evolves — and uncountable research groups worldwide keep pushing the envelope of the field with new methods, resources, and research questions — there comes a time when even a workhorse like BWA reaches its limits. We've simply arrived at a point where we need new tools to tackle the next generation of challenges.
DRAGMAP builds upon the core ideas that have sustained BWA, but with some innovations that will enable it to address those new challenges in a future-forward way.
Which brings us to the inevitable question…
How does DRAGMAP compare to BWA-MEM?
We've previously noted that the DRAGEN-GATK pipeline produces more accurate results than our previously established Best Practices pipeline, which uses BWA-MEM for mapping. Let's look at the results of a more recent test, in which we compared variant calling results derived from BAM alignments made with BWA-MEM and with DRAGMAP, respectively, against an hg38 reference, using the same commands and parameters for the variant calling step.
This "ROC curve" plot shows the tradeoff between precision on the X-axis (aka, "how many false positives are we letting in") and sensitivity on the Y-axis (aka, "how many real variants are we finding"), for SNP callsets made from data aligned with BWA-MEM (in blue) and DRAGMAP (in orange).
If you’re not used to looking at this kind of plot, the main thing you need to know is that you want your ROC curve to hug the axes and come as close to the top left corner as possible, since that's where you can maximize sensitivity while minimizing the amount of false positives you have to accept. In the above plot, we see that the count of false positive SNPs derived from the DRAGMAP-aligned callset is consistently lower than the one produced by the BWA-MEM-aligned callset, at every possible level of sensitivity.
So, where are all those false positives in the BWA-MEM-aligned callset coming from, and how does DRAGMAP do it better?
A reference masking approach to resolve alt contig alignment problems
When we dig into the BWA-MEM alignments, we find that many of those false positive calls fall in regions of the genome where there are alt contigs that share a lot of sequence with the primary contigs. This poses a challenge for the genome mapper, which has to decide whether it makes more sense to align reads to the primary contig or to the alt contig.
DRAGMAP uses a reference masking strategy to resolve this challenge, and all our tests have shown this to produce superior accuracy over previous approaches, with fewer false positives in downstream variant calling. The Illumina team has published an article about this reference masking approach on Illumina's Genomics Research blog.
Note that DRAGMAP testing has focused on available samples from the Genome in a Bottle project, and we have not specifically evaluated this approach on genomes from populations that exhibit greater diversity, so there may be as-yet unrecognized limitations here. We have also not done any testing on non-human genomes.
One of the most promising approaches for meaningfully improving results for diverse populations may be the use of graph genomes. The latest hardware version of DRAGEN already supports using a graph genome for mapping, which contributed to DRAGEN's winning performance in the Precision FDA Truth Challenge 2 on Illumina reads. Porting graph support to DRAGMAP will be a big undertaking, and was therefore out of scope for the very first software version of DRAGMAP, but it will be considered for future DRAGMAP versions. We’ll keep you posted on any developments.
For now, we encourage you to check out this very first, brand-spanking-new alpha version of DRAGMAP by visiting the GitHub repository and taking it for a spin.
If you do, please let us know how it goes in the forum; though if you run into any trouble, you'll want to post directly on the DRAGMAP GitHub issues page to reach the Illumina development team.
Upcoming DRAGEN-GATK Webinar
Also, remember to join us for the upcoming DRAGEN-GATK webinar and Q&A session on Thursday, Dec. 2, 2021 from 10AM-11AM EST by visiting us at THIS ZOOM LINK!
We'll also be taking questions prior to the webinar in the Webinar Discussion section of the forum, so please let us know what you want to know!
Appendix: Runtime expectations
This first version of DRAGMAP has not been optimized for speed. Here are the current runtime expectations for common configurations and workloads, using BWA-MEM performance as a point of comparison.
With 80 threads:
With 16 threads:
Edit 1: (2021-12-01) Benchmark ROC plot was updated 2021-12-01 to reflect the latest comparison between DRAGMAP and BWA MEM (the previous image resulted from an intermediate development version and was used in error).
Edit 2: (2021-12-09) Added a link to Illumina's blog post explaining the versions of GRCh38/hg38 reference genomes and how they are used in DRAGEN.
Reference masking is not an innovation but a necessity and a well known practice for years. Giab released v2 for hg38. Shortcomings quassi solved by dragmap cannot be called innovation. I am wondering about the results of a comparison where none of the maskable regions apply with a reference genome like hs37d5.
Please sign in to leave a comment.