Broad-Intel Genomics Stack (BIGstack) is an end-to-end, optimized solution on Intel hardware for analyzing genomic data. It provides an efficient way to run pre-packaged, optimized workflows, including the GATK Best Practices workflows.
BIGstack’s software stack includes two components developed by Intel for efficient and scalable execution of genomics workflows: GenomicsDB and the Genomics Kernel Library (GKL). GenomicsDB is a data store for genomic variants. It is based on the TileDB array storage manager, a system for efficiently storing, querying, and accessing sparse and dense matrix/array data. GKL is a collection of common, compute-intensive kernels used in genomic analysis tools. Intel and The Broad Institute worked together to identify these kernels in GATK, and experts across Intel optimized the kernels for Intel architecture.
BIGstack also includes support to run other open-source libraries of genomic analysis tools: Picard, BWA, and Samtools. These tools perform a wide variety of tasks, from sorting and fixing tags to generating recalibration models. Users specify the files to be analyzed, what tools they want to use, and the order in which the execution engine (Cromwell) performs the tasks using Workflow Description Language (WDL) files.
Guide to running GATK on a local HPC
We provide here a resource to guide you around the relevant documentation you need to get GATK up and running on your local high-performance computing (HPC) environment.
Intel can provide guidance on hardware reference designs for running secondary analysis. Some basic guidance can be found here. For more information, please have your system administrator contact their Intel representative.
If you want to run GATK on your own system, you’ll need to get acquainted with WDL, a community-driven user-friendly scripting language, and Cromwell, an open-source workflow execution engine that can connect to a variety of different platforms through pluggable backends.
In this guide, you will find instructions and links for installing all prerequisites and tools necessary for running GATK on a system of your choosing:
- Before you start: Assumptions and prerequisites
- Install prerequisites
- Install GATK and non-GATK tools
- Install and configure Cromwell
- Run a sample workflow
- Access GATK WDLs for running on-premises
Before you start: Assumptions and prerequisites
The GATK documentation explains software dependencies for running GATK.
We make the assumption that a job scheduler is already installed. Cromwell supports running GATK on different infrastructure backends, including HPC job schedulers. SLURM is recommended; please see the Cromwell docs for a full list of supported HPC job schedulers.
When running GATK on an HPC cluster, all prerequisites listed below should be installed on the Application or Head Node. Cromwell should also be installed on this Application Node. WDLs, JSONs, datasets, and tools (GATK and non-GATK) should all be installed on a shared file system, or a drive accessible by all Compute Nodes. An overview of what gets installed where can be found here.
Here is a list of prerequisites you should install:
- Git: A version control system for tracking changes to open-sourced services.
- Java: Required to run Cromwell.
- sbt: Required to compile Cromwell.
- MySQL: Used by Cromwell for persistent storage.
- Docker: Used by Cromwell to run introductory tutorial; can be used to run full workflows.
- MariaDB: an alternative to MySQL, uses same commands - plays well with Fedora OS, for call caching (the ability to bypass running tasks that have previously succeeded)
Install GATK and non-GATK tools
You’ve got your environment set up and you are ready to rock
The GATK Quick Start Guide takes you through the simple steps for installing GATK and testing that it works.
There are non-GATK tools that you will need to install as well, see the links below to help you get started:
Install and configure Cromwell
Run a sample workflow
Write and run a simple WDL script as a smoke test
Run a sanity check using HaplotypeCaller
The hello_gatk.wdl runs HaplotypeCaller in GVCF mode. The workflow analyzes alignment data in BAM format and produces GVCF variant calls. The sample data to use for this test run can be found here.
Open up the inputs file,
hello_gatk.inputs.json, to check that it is filled out already. You may need to adapt the paths depending on where you put the data files. Absolute file paths are preferred.
Run hello_gatk.wdl. Note that you have to change the path from hello_world in the previous steps to hello_gatk.
java -jar /gatk/my_data/jars/cromwell-38.jar \ run /gatk/my_data/hello_gatk/hello_gatk.wdl \ -i /gatk/my_data/hello_gatk/hello_gatk.inputs.json
Access GATK WDLs for running on-prem
At this point you should be ready to run your own workflows on your local HPC. You can download the official and up-to-date WDL scripts here. The scripts have been validated with Cromwell v36.