Contents
- Skills
- Input data
- Software
- Hardware
1. Skills / experience
We aim to make the tools usable by everyone, regardless of your background.
The GATK does not have a Graphical User Interface (GUI). You don't open it by clicking on the .jar
file; to use it directly you have to use the Console (or Terminal) to input commands. If this is all new to you, we recommend you first learn about that and get some basic training (we recommend Software Carpentry) before trying to use the GATK. It's not difficult but you'll need to learn some jargon and get used to living without a mouse. Trust us, it's a liberating experience.
If you prefer working in a point-and-click environment, consider trying Terra. Terra is a secure, freely accessible cloud-based analysis portal developed at the Broad Institute. It includes preconfigured GATK Best Practices pipelines as well as tools for building your own custom pipelines (with any command line tool you want, not just GATK).
Note that Terra is not a GUI-only solution; it's also possible to interact with it programmatically through an API and still take advantage of all the work we've done to preconfigure GATK pipelines and working examples.
2. Input data
Typical inputs and format requirements are documented here as well as in each tool's respective Tool Doc.
3. Software
Most GATK4 tools have fairly simple software requirements: a Unix-style OS and Java 1.8. However, a subset of tools have additional R or Python dependencies. These dependencies (as well as the base system requirements) are described further below. We strongly encourage you to use the Docker container system, if that's an option on your infrastructure, rather than a custom installation. All released versions of GATK4 can be found as prepackaged container images in Dockerhub here.
Operating system
The GATK runs natively on most if not all flavors of UNIX, which includes MacOSX, Linux and BSD. It is possible to get it running on some recent versions of Windows, but we don't provide any support nor instructions for that. If you need to run on a Windows machine, consider using Docker.
Java 8 / JRE or SDK 1.8
The GATK is a Java-based program, so you'll need to have Java installed on your machine. The Java runtime version should be at 1.8 exactly. To be clear: we do not yet support 1.9, and older versions (1.6 and 1.7) no longer work. You can check what version you have by typing java -version
at the command line. This article has some more details about what to do if you don't have the right version. Both the Sun/Oracle Java JDK and OpenJDK versions are fully supported.
R dependencies
Some of the GATK tools produce plots using R, so if you want to get the plots you'll need to have R and Rscript installed, as well as these R libraries: gsalib
, ggplot2
, reshape
, gplots
,
Python dependencies
The gatk-launch
wrapper script requires Python 2.6 or greater.
Some of the newer tools and workflows require Python 3.6.2 along with a set of additional Python packages. We use the Conda package manager to establish and manage the environment and dependencies required by these tools. The GATK Docker image comes with this environment pre-configured. In order to establish an environment suitable to run these tools outside of the Docker image, we provide a Conda config file, gatkcondaenv.yml. To use this, you must first install Conda, then create the GATK-appropriate environment by running the following command:
conda env create -n gatk -f gatkcondaenv.yml
To activate the environment once it has been created, run the command:
source activate gatk
See the Conda documentation for additional information about using and managing Conda environments.
Developers only
If you plan to build GATK from source, you will need Git 2.5 or greater, git-lfs 1.1.0 or greater, and Gradle 3.1 or greater. Use the ./gradlew script to build from source; see the Github repository README for more details.
4. Hardware
We do not provide guidelines for hardware requirements, as these can vary enormously depending on the type of work you plan to do. However, you may find the following helpful:
Local infrastructure
Our collaborators at the Intel-Broad Center for Genomic Data Engineering can provide you with recommended hardware configurations based on your planned usage. Let us know in the comment thread if you'd like us to introduce you.
Cloud options
As noted above, we make our own cloud-based analysis portal freely available to everyone. It is built on Google Cloud; using the portal is free of charge, and compute/storage/egress costs are charged directly by Google. The advantage to you of using this portal is that we have already set up preconfigured workspaces for all the GATK Best Practices (including runtime hardware parameters, memory etc), and you also have the option of adding your own custom pipelines. This removes most of the typical limitations and guesswork involved in working with local infrastructure, and it also makes it easier to share your results and methods.
In addition, we are working with all the other major commercial cloud vendors to make it easy to run GATK pipelines on their platforms. See the "Pipelining Options" documentation for more details.
6 comments
"Our collaborators at the Intel-Broad Center for Genomic Data Engineering can provide you with recommended hardware configurations based on your planned usage. Let us know in the comment thread if you'd like us to introduce you."
Can you please introduce me to "Intel-Broad Center for Genomic Data Engineering"? Need guidance on hardware configuration, appreciate your help.
22 Aug 2023: This may help others: I downloaded version 1.8 from a variety of sources including oracle, openjdk. None worked with GATK. Belatedly found the advice in the downloaded GATK readme which advised downloading
I am planning to buy a laptop dedicated for GATK pathseq. I am a novice to GATK (have some basic R experience) and not sure to buy a PC or mac or unix. If I buy a PC and create an UNIX OS, can I run GATK pathseq? How many core processors are needed? what RAM is needed? 32GB? or 64GB or higher? i7 processor is ok? A 1TB SSD will be enough? I can go upto 4TB. What is a docker? I see it here: https://hub.docker.com/r/broadinstitute/gatk/ but have no idea. Can someone let me know if this PC is good enough to start with an UNIX OS? https://www.amazon.com/gp/product/B091RR4D7N/ref=ox_sc_saved_title_3?smid=A3MS2WDGGX0NZU&psc=1
Or need something like this? https://www.amazon.com/Apple-MacBook-Laptop-12%E2%80%91core-30%E2%80%91core/dp/B0BSHG76FM/ref=sr_1_3?crid=3HV1QLVDQ1M9C&keywords=macbook&qid=1705737346&refinements=p_n_size_browse-bin%3A2423840011%2Cp_n_feature_thirty-five_browse-bin%3A35913653011%2Cp_n_feature_two_browse-bin%3A5446812011&rnid=562234011&s=pc&sprefix=macbook%2Caps%2C106&sr=1-3
Please let me know. Thanks.
Our collaborators at the Intel-Broad Center for Genomic Data Engineering can provide you with recommended hardware configurations based on your planned usage. Let us know in the comment thread if you'd like us to introduce you."
Can you please introduce me to "Intel-Broad Center for Genomic Data Engineering"? Need guidance on hardware configuration.
Thank you!
Thank you. Yes, I would prefer to have a recommendation.
My planned use is
GATK preferably a pc/unix platform
I might use numerous (~1000s) of fastq files for GATK, but wont prefer to use cloud or online services.
So running the task quicker would be my preference (RAM, core processors, SSD recommendations are appreciated).
Any video or other tutorials are much appreciated.
Or a in person hands on training recommendation is also great..!!
Thank you.
I would also like an introduction, thank you!
Please sign in to leave a comment.