This document provides important context information about how the GATK Best Practices are developed and what are their limitations.
- What are the GATK Best Practices?
- Analysis phases
- Experimental designs
- Workflow scripts provided as reference implementations
- Scope and limitations
- What is not GATK Best Practices?
- Beware legacy scripts
1. What are the GATK Best Practices?
Reads-to-variants workflows used at the Broad Institute.
The GATK Best Practices provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. There are several different GATK Best Practices workflows tailored to particular applications depending on the type of variation of interest and the technology employed. The Best Practices documentation attempts to describe in detail the key principles of the processing and analysis steps required to go from raw reads coming off the sequencing machine, all the way to an appropriately filtered variant callset that can be used in downstream analyses.
Wherever we can, we try to provide guidance regarding experimental design, quality control (QC) and pipeline implementation options, but please understand that those are dependent on many factors including sequencing technology and the hardware infrastructure that are at your disposal, so you may need to adapt our recommendations to your specific situation.
2. Analysis phases
Although the Best Practices workflows are each tailored to a particular application (type of variation and experimental design), overall they follow similar patterns, typically comprising two or three analysis phases depending on the application.
(1) Data pre-processing is the first phase in all cases, and involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.
(2) Variant discovery proceeds from analysis-ready BAM files and produces variant calls. This involves identifying genomic variation in one or more individuals and applying filtering methods appropriate to the experimental design. The output is typically in VCF format although some classes of variants (such as CNVs) are difficult to represent in VCF and may therefore be represented in other structured text-based formats.
(3) Additional steps such as filtering and annotation may be required to produce a callset ready for downstream genetic analysis, depending on the application. This typically involves using resources of known variation, truthsets and other metadata to assess and improve the accuracy of the results as well as attach additional information.
3. Experimental designs
Whole genomes. Exomes. Gene panels. RNAseq.
These are the major experimental designs we support explicitly. Some of our workflows are specific to only one experimental design, while others can be adapted to others with some modifications. This is indicated in the workflow documentation where applicable. Note that any workflow tagged as applicable to whole genome sequence (WGS) and others is presented by default in the form that is suitable for whole genomes, and must be modified to apply to the others as recommended in the workflow documentation. Exomes, gene panels and other targeted sequencing experiments generally share the same workflow for a given variant type with only minor modifications.
4. Workflow scripts provided as reference implementations
Less guesswork, more reproducibility.
It's one thing to know what steps should be run (which is what the Best Practices tell you) and quite another to set up a pipeline that does it in practice. To help you cross this important gap, we provide the scripts that we use in our own pipelines as reference implementations. The scripts are written in WDL, a workflow description language designed specifically to be readable and writable by humans without an advanced programming background. WDL scripts can be run on Cromwell, an open-source execution engine that can connect to a variety of different platforms, whether local or cloud-based, through pluggable backends. See the Pipelining Options article for more on the Cromwell + WDL pipelining solution.
Note that some of the production scripts we provide are specifically optimized to run on the Google Cloud Platform. Wherever possible we also provide "generic" versions that are not platform-specific.
5. Scope and limitations
We can't test for every possible use case or technology.
We develop and validate these workflows in collaboration with many investigators within the Broad Institute's network of affiliated institutions. They are deployed at scale in the Broad's production pipelines -- a very large scale indeed. So as a general rule, the command-line arguments and parameters given in the documentation are meant to be broadly applicable (so to speak). However, our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data, organisms or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values.
In addition, several key steps make use of external resources, including validated databases of known variants. If there are few or no such resources available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution. On the bright side, if you solve them for your field, you will be a hero to a generation of researchers and your citation index will go through the roof.
6. What is not GATK Best Practices?
Lots of workflows that people call GATK Best Practices diverge significantly from our recommendations.
Not that they're necessarily bad. Sometimes it makes perfect sense to diverge from our standard Best Practices in order to address a problem or use case that they're not designed to handle. The canonical Best Practices workflows (as run in production at the Broad) are designed specifically for human genome research and are optimized for the instrumentation (overwhelmingly Illumina) and needs of the Broad Institute sequencing facility.
They can be adapted for analysis of non-human organisms of all kinds, including non-diploids, and of different data types, with varying degrees of effort depending on how divergent the use case and data type are. However, any workflow that has been significantly adapted or customized, whether for performance reasons or to fit a use case that we do not explicitly cover, should not be called "GATK Best Practices", which is a term that carries specific meaning. The correct way to refer to such workflows is "based on" or "adapted from" GATK Best Practices.
When in doubt about whether a particular customization constitutes a significant divergence, feel free to ask us in the forum.
7. Beware legacy scripts
Trust, but verify.
If someone hands you a script and tells you "this runs the GATK Best Practices", start by asking what version of GATK it uses, when it was written, and what are the key steps that it includes.
Both our software and our usage recommendations evolve in step with the rapid pace of technological and methodological innovation in the field of genomics, so what was Best Practice last year (let alone in 2010) may no longer be applicable. And if all the steps seem to be in accordance with our docs (same tools in the same order), you should still check every single parameter in the commands. If anything is unfamiliar to you, you should find out what it does. If you can't find it in the documentation, ask us in the forum.
We understand, it will take a few hours of your life to verify the code. It's one or two hours of your life that can save you days of troubleshooting on the tail end of the pipeline, so please protect yourself by being thorough.