If you know only one thing about NVIDIA, it's probably that they manufacture popular graphics cards for computers. They started out targeting hardware to gaming machines, branding themselves as “the way it’s meant to be played”.
However, for those of you who know two things about NVIDIA, chances are that you're also aware of the company's reputation as an innovator in machine learning and AI, which they've been applying to a number of fields, from self-driving cars to healthcare and the life sciences — including genomics.
In fact, we've had the pleasure of working with NVIDIA on a couple of projects for the past year, and we're excited to bring you some news about tools and features that are coming out of this collaboration.
NVIDIA's most widely known contribution to genomics so far is Clara Parabricks, a suite of GPU-accelerated genomic analysis workflows based on the GATK Best Practices. It is capable of analyzing a whole genome in under 25 minutes, and exomes in 4 minutes, all while saving costs. Those of you who follow the Terra blog will have already seen that the Clara Parabricks workflows are now available in a Terra workspace, making them widely available to researchers on the cloud. That was a big step forward, so please make sure to check out the Terra blog post about this workspace for more info!
Today, we are proud to introduce another output of our ongoing collaboration with NVIDIA: Say hello to NVScoreVariants, a new deep learning tool for filtering variants using convolutional neural networks (CNN), which delivers key improvements over the "old" CNNScoreVariants tool.
New look, same great taste
For calling germline short variants from single samples, GATK provides a deep learning method that uses convolutional neural networks (CNN) to generate variant quality scores. These scores can then be used for filtering, i.e. separating true variants from likely artifacts (see "Main steps for Germline Single-Sample Data" in the Germline short variants Best Practices documentation).
Until now, the tool responsible for scoring variants using a pre-trained convolutional neural network was the very aptly-named CNNScoreVariants. The new tool contributed by NVIDIA, NVScoreVariants, is a drop-in replacement for CNNScoreVariants. On the surface, NVScoreVariants does the same thing that CNNScoreVariants does — in fact, the results you'll get from the new tool have been optimized to be functionally equivalent to CNNScoreVariants, and performance is comparable between the two tools.
Under the hood, however, there are some important differences that will benefit your work going forward. The original GATK-CNN filtering tool, CNNScoreVariants, uses the TensorFlow machine learning framework. In contrast, NVScoreVariants uses the PyTorch framework.
Changing the framework to PyTorch has a few key advantages over TensorFlow. First, it can run natively in GATK's built-in Python environment, and therefore causes fewer installation headaches if you're not using our Docker container. It’s also more lightweight, and it's faster to develop for. That gives us more room to further optimize the tool in the future, specifically to operate at higher speed (so it will scale better when applied to large datasets).s
Finally, compared with Tensorflow, PyTorch has better debugging capabilities, active development, and outstanding community support. This ultimately means that the more adventurous among you can modify or calibrate NVScoreVariants in order to develop new tools for your specific problems/datasets, even without extensive programming and machine-learning expertise!
To be clear, this first release of NVScoreVariants does not aim to improve performance over CNNScoreVariants. Instead, it is a complete modernization of an older tool that opens the door for faster development and optimization. This means better, faster, and more reliable results in future releases.
Take it for a test drive
As always, we encourage you to try our new GATK tools out for yourself, to see if they are worth a place in your pipeline.
NVScoreVariants is currently integrated into the latest release of GATK, so if you haven’t updated yet , then let this be the reason! Below, you can also find various ways to access the tools:
- Docker image: A GATK snapshot suitable for running NVScoreVariants is available on DockerHub at
broadinstitute/gatk-dev:NVSCOREVARIANTS-PREVIEW-SNAPSHOT. The source code that this image was built from is available on GitHub in the following "NVScoreVariants-Preview" repository.
- Docker commands: You can also download and run the image using the following Docker commands. On startup, you will get a message about how the image is only for running NVScoreVariants, and has a custom Python environment set up specifically for that tool — this is normal.
docker pull broadinstitute/gatk-dev:NVSCOREVARIANTS-PREVIEW-SNAPSHOT
docker run -it broadinstitute/gatk-dev:NVSCOREVARIANTS-PREVIEW-SNAPSHOT
- Example commands: Here is an example GATK command for running NVScoreVariants with the 1D model within the image:
./gatk NVScoreVariants -V input.vcf -R reference.fasta -O 1Dout.vcf
For comparison, here is an example GATK command for running NVScoreVariants with the 2D model within the image:
./gatk NVScoreVariants -V input.vcf -R reference.fasta --tensor-type read_tensor -I reads.bam -O 2Dout.vcf
You can find more information about NVScoreVariants and the other changes we’ve made to GATK by reading our Release Notes for the most recent release.
Finally, this is just one small slice of a larger collaboration between the Broad Institute and NVIDIA that aims to enable breakthroughs in the way we understand disease, develop diagnostics and deliver treatments (see the video overview of the partnership for more details). So while you are test-driving NVScoreVariants, please stay tuned for more information about our work together in the future!