GATK can be deployed on high performance computing (HPC) systems using an HPC batch scheduler. Intel provides a fully integrated solution to help users set up and run GATK workflows on HPC. This solution, or reference architecture, includes a hardware bill of materials, recommended software stack, and step by step install guides. See this link for an overview of the solution.
In addition to this solution, or reference architecture, Intel has partnered with the Broad Institute to integrate collection of common, compute-intensive kernels used in genomic analysis tools into GATK. The Genomics Kernel Library (GKL) includes AVX-512 optimizations and compression and decompression libraries, and is distributed open source with GATK.
The Intel reference architecture supports other open source genomics libraries, including Picard, BWA, and Samtools. These tools perform a wide variety of tasks, from sorting and fixing tags to generating recalibration models. Users specify the files to be analyzed, what tools they want to use, and the order in which the execution engine (Cromwell) performs the tasks using Workflow Description Language (WDL) files.
Users with existing HPC systems can follow the instructions below to install and run GATK "bare metal," ie without containers or VMs.
Guide to running GATK on a local HPC
We provide here a resource to guide you around the relevant documentation you need to get GATK up and running on your local high-performance computing (HPC) environment.
Intel can provide guidance on hardware reference designs for running secondary analysis. Some basic guidance can be found here. For more information, please have your system administrator contact their Intel representative.
If you want to run GATK on your own system, you’ll need to get acquainted with WDL, a community-driven user-friendly scripting language, and Cromwell, an open-source workflow execution engine that can connect to a variety of different platforms through pluggable backends.
In this guide, you will find instructions and links for installing all prerequisites and tools necessary for running GATK on a system of your choosing:
- Before you start: Assumptions and prerequisites
- Install prerequisites
- Install GATK and non-GATK tools
- Install and configure Cromwell
- Run a sample workflow
- Access GATK WDLs for running on-premises
Before you start: Assumptions and prerequisites
The GATK documentation explains software dependencies for running GATK.
We make the assumption that a job scheduler is already installed. Cromwell supports running GATK on different infrastructure backends, including HPC job schedulers. SLURM is recommended; please see the Cromwell docs for a full list of supported HPC job schedulers.
When running GATK on an HPC cluster, all prerequisites listed below should be installed on the Application or Head Node. Cromwell should also be installed on this Application Node. WDLs, JSONs, datasets, and tools (GATK and non-GATK) should all be installed on a shared file system, or a drive accessible by all Compute Nodes. An overview of what gets installed where can be found here.
Here is a list of prerequisites you should install:
- Git: A version control system for tracking changes to open-sourced services.
- Java: Required to run Cromwell.
- sbt: Required to compile Cromwell.
- MySQL: Used by Cromwell for persistent storage.
- Docker: Used by Cromwell to run introductory tutorial; can be used to run full workflows.
- MariaDB: an alternative to MySQL, uses same commands - plays well with Fedora OS, for call caching (the ability to bypass running tasks that have previously succeeded)
Install GATK and non-GATK tools
You’ve got your environment set up and you are ready to rock
The GATK Quick Start Guide takes you through the simple steps for installing GATK and testing that it works.
There are non-GATK tools that you will need to install as well, see the links below to help you get started:
Install and configure Cromwell
Run a sample workflow
Write and run a simple WDL script as a smoke test
Run a sanity check using HaplotypeCaller
The hello_gatk.wdl runs HaplotypeCaller in GVCF mode. The workflow analyzes alignment data in BAM format and produces GVCF variant calls. The sample data to use for this test run can be found here.
Open up the inputs file,
hello_gatk.inputs.json, to check that it is filled out already. You may need to adapt the paths depending on where you put the data files. Absolute file paths are preferred.
Run hello_gatk.wdl. Note that you have to change the path from hello_world in the previous steps to hello_gatk.
java -jar /gatk/my_data/jars/cromwell-38.jar \ run /gatk/my_data/hello_gatk/hello_gatk.wdl \ -i /gatk/my_data/hello_gatk/hello_gatk.inputs.json
Access GATK WDLs for running on-prem
At this point you should be ready to run your own workflows on your local HPC. You can download the official and up-to-date WDL scripts here. The scripts have been validated with Cromwell v36.