GATK can be deployed on high performance computing (HPC) systems using an HPC batch scheduler. Intel provides a fully integrated solution to help users set up and run GATK workflows on HPC. This solution, or reference architecture, includes a hardware bill of materials, recommended software stack, and step by step install guides. See this link for an overview of the solution.
In addition to this solution, or reference architecture, Intel has partnered with the Broad Institute to integrate collection of common, compute-intensive kernels used in genomic analysis tools into GATK. The Genomics Kernel Library (GKL) includes AVX-512 optimizations and compression and decompression libraries, and is distributed open source with GATK.
The Intel reference architecture supports other open source genomics libraries, including Picard, BWA, and Samtools. These tools perform a wide variety of tasks, from sorting and fixing tags to generating recalibration models. Users specify the files to be analyzed, what tools they want to use, and the order in which the execution engine (Cromwell) performs the tasks using Workflow Description Language (WDL) files.
Users with existing HPC systems can follow the instructions below to install and run GATK "bare metal," ie without containers or VMs.
Guide to running GATK on a local HPC
We provide here a resource to guide you around the relevant documentation you need to get GATK up and running on your local high-performance computing (HPC) environment.
Intel can provide guidance on hardware reference designs for running secondary analysis. Some basic guidance can be found here. For more information, please have your system administrator contact their Intel representative.
If you want to run GATK on your own system, you’ll need to get acquainted with WDL, a community-driven user-friendly scripting language, and Cromwell, an open-source workflow execution engine that can connect to a variety of different platforms through pluggable backends.
In this guide, you will find instructions and links for installing all prerequisites and tools necessary for running GATK on a system of your choosing:
- Before you start: Assumptions and prerequisites
- Install prerequisites
- Install GATK and non-GATK tools
- Install and configure Cromwell
- Run a sample workflow
- Access GATK WDLs for running on-premises
Before you start: Assumptions and prerequisites
The GATK documentation explains software dependencies for running GATK.
We make the assumption that a job scheduler is already installed. Cromwell supports running GATK on different infrastructure backends, including HPC job schedulers. SLURM is recommended; please see the Cromwell docs for a full list of supported HPC job schedulers.
When running GATK on an HPC cluster, all prerequisites listed below should be installed on the Application or Head Node. Cromwell should also be installed on this Application Node. WDLs, JSONs, datasets, and tools (GATK and non-GATK) should all be installed on a shared file system, or a drive accessible by all Compute Nodes. An overview of what gets installed where can be found here.
Here is a list of prerequisites you should install:
- Git: A version control system for tracking changes to open-sourced services.
- Java: Required to run Cromwell.
- sbt: Required to compile Cromwell.
- MySQL: Used by Cromwell for persistent storage.
- Docker: Used by Cromwell to run introductory tutorial; can be used to run full workflows.
- MariaDB: an alternative to MySQL, uses same commands - plays well with Fedora OS, for call caching (the ability to bypass running tasks that have previously succeeded)
Install GATK and non-GATK tools
You’ve got your environment set up and you are ready to rock
The GATK Quick Start Guide takes you through the simple steps for installing GATK and testing that it works.
There are non-GATK tools that you will need to install as well, see the links below to help you get started:
Install and configure Cromwell
See tutorials for installing and configuring Cromwell.
Run a sample workflow
Write and run a simple WDL script as a smoke test
To test if you have successfully installed the toolkit, follow these instructions to run a simple workflow. You can also download a simple “Hello World” WDL script here.
Run a sanity check using HaplotypeCaller
The hello_gatk.wdl runs HaplotypeCaller in GVCF mode. The workflow analyzes alignment data in BAM format and produces GVCF variant calls. The sample data to use for this test run can be found here.
Open up the inputs file,
hello_gatk.inputs.json, to check that it is filled out already. You may need to adapt the paths depending on where you put the data files. Absolute file paths are preferred.
Run hello_gatk.wdl. Note that you have to change the path from hello_world in the previous steps to hello_gatk.
java -jar /gatk/my_data/jars/cromwell-38.jar \ run /gatk/my_data/hello_gatk/hello_gatk.wdl \ -i /gatk/my_data/hello_gatk/hello_gatk.inputs.json
Access GATK WDLs for running on-prem
At this point you should be ready to run your own workflows on your local HPC. You can download the official and up-to-date WDL scripts here. The scripts have been validated with Cromwell v36.
For testing the installation we need a inputs .json file - but where is the 'hello_gatk.inputs.json' file?
I can not find it in the sample directory or in any of the associated links.
I have been using the GATK Docker image with Docker Desktop on a Windows workstation and it works fine. As a bit of a displacement activity while I am dodging delta I am thinking about building a small HPC -- a Baby Beowulf cluster -- with a controller and 3 worker nodes using gen n-1 'experienced' servers. There seems to be many available from a number of second hand dealers.
1) The Lustre documents says it is good for > 100 nodes which suggests that it is overkill for a Baby Beowulf. As the nearly 600 pages of documents show, Lustre has all sorts of things needed for running large HPCs but its admin may be little more than an admin hassle to for a small system?
2) Each node will have fast CPUs and lots of memory sticks but what is the minimal amount of disk storage on each node? OS + SLURM + docker images + genomic references + N ( resident input and output files). This suggests that a flash drive or NVMe might be all that is needed on each node? The files being analyzed will be passing through and not stored so a huge amount of space is not needed on the little cluster.
Please sign in to leave a comment.