To date, we have published three papers on GATK, a preprint in bioRxiv (citation details below), as well as a book. You're welcome to choose which resource is most representative of what aspect of GATK you called on in your work.
That said, please remember that as our work is continuous, and that our Best Practices recommendations evolve — specific command lines, argument values, and even tool choices described in any resource will eventually become obsolete. Please be sure to always refer to our Best Practices documentation for the most up-to-date and version-appropriate recommendations.
Van der Auwera & O'Connor (2020). Best reference for GATK
This book is the definitive reference for research with genomics algorithms using the GATK, Docker, WDL, and Terra. We ask that you cite this book for work using GATK.
- Van der Auwera GA & O'Connor BD. (2020). Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (1st Edition). O'Reilly Media.
Poplin et al. (2017). Detailed description of HaplotypeCaller; best reference for germline joint calling
This is the fourth paper, technically just a manuscript deposited in bioRxiv -- but it counts! This is a good citation to include in a Materials and Methods section or in a Discussion if you're talking about the joint calling process.
- Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, Shakir K, Thibault J, Chandran S, Whelan C, Lek M, Gabriel S, Daly MJ, Neale B, MacArthur DG, Banks E. (2017). Scaling accurate genetic variant discovery to tens of thousands of samples bioRxiv, 201178. DOI: 10.1101/201178
Van der Auwera et al. (2013). Hands-on tutorial with step-by-step explanations
This is the third GATK paper, which describes the Best Practices for Variant Discovery (version 2.x). It is intended mainly as a learning resource for first-time users and as a protocol reference. This is a good citation to include in a Materials and Methods section, however it contains information that is out-of-date when compared to the 2020 book, Genomics in the Cloud: Using Docker, GATK, and WDL in Terra. As a result, we recommend using the 2020 book as a citation over the 2013 paper.
- Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M. (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Curr Protoc Bioinformatics, 43:11.10.1-11.10.33. DOI: 10.1002/0471250953.bi1110s43.
DePristo et al. (2011). First incarnation of the Best Practices workflow
This is the second GATK paper, and describes in more detail some of the key tools commonly used in the GATK for high-throughput sequencing data processing and variant discovery. This paper covers base quality score recalibration, indel realignment, SNP calling with UnifiedGenotyper, variant quality score recalibration and their application to deep whole genome, whole exome, and low-pass multi-sample calling. This is a good citation if you use the GATK for variant discovery. | Pubmed
- DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet, 43:491-498. DOI: 10.1038/ng.806.
Note that the workflow described in this paper corresponds to the version 1.x to 2.x best practices. Some key steps for variant discovery have been significantly modified in later versions (3.x onwards). This paper should not be used as a definitive guide to variant discovery with GATK. For that, please see our online documentation guide.
McKenna et al. (2010). Original description of the GATK framework
This is the first GATK paper, which covers the computational philosophy underlying the GATK and is a good citation for the GATK in general.
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 20:1297-303. DOI: 10.1101/gr.107524.110.
We sequenced 10 samples on 10 lanes on an Illumina HiSeq 2000, aligned the resulting reads to the hg19 reference genome with BWA (Li & Durbin), applied GATK (McKenna et al., 2010) base quality score recalibration, indel realignment, duplicate removal, and performed SNP and INDEL discovery and genotyping across all 10 samples simultaneously using standard hard filtering parameters or variant quality score recalibration according to GATK Best Practices recommendations (DePristo et al., 2011; Van der Auwera & O'Connor, 2020).