- Input data
1. Skills / experience
We aim to make the tools usable by everyone, regardless of your background.
The GATK does not have a Graphical User Interface (GUI). You don't open it by clicking on the
.jar file; to use it directly you have to use the Console (or Terminal) to input commands. If this is all new to you, we recommend you first learn about that and get some basic training (we recommend Software Carpentry) before trying to use the GATK. It's not difficult but you'll need to learn some jargon and get used to living without a mouse. Trust us, it's a liberating experience.
If you prefer working in a point-and-click environment, consider trying Terra. Terra is a secure, freely accessible cloud-based analysis portal developed at the Broad Institute. It includes preconfigured GATK Best Practices pipelines as well as tools for building your own custom pipelines (with any command line tool you want, not just GATK).
Note that Terra is not a GUI-only solution; it's also possible to interact with it programmatically through an API and still take advantage of all the work we've done to preconfigure GATK pipelines and working examples.
2. Input data
Typical inputs and format requirements are documented here as well as in each tool's respective Tool Doc.
Most GATK4 tools have fairly simple software requirements: a Unix-style OS and Java 1.8. However, a subset of tools have additional R or Python dependencies. These dependencies (as well as the base system requirements) are described further below. We strongly encourage you to use the Docker container system, if that's an option on your infrastructure, rather than a custom installation. All released versions of GATK4 can be found as prepackaged container images in Dockerhub here.
The GATK runs natively on most if not all flavors of UNIX, which includes MacOSX, Linux and BSD. It is possible to get it running on some recent versions of Windows, but we don't provide any support nor instructions for that. If you need to run on a Windows machine, consider using Docker.
Java 8 / JRE or SDK 1.8
The GATK is a Java-based program, so you'll need to have Java installed on your machine. The Java runtime version should be at 1.8 exactly. To be clear: we do not yet support 1.9, and older versions (1.6 and 1.7) no longer work. You can check what version you have by typing
java -version at the command line. This article has some more details about what to do if you don't have the right version. Both the Sun/Oracle Java JDK and OpenJDK versions are fully supported.
Some of the GATK tools produce plots using R, so if you want to get the plots you'll need to have R and Rscript installed, as well as these R libraries:
gatk-launch wrapper script requires Python 2.6 or greater.
Some of the newer tools and workflows require Python 3.6.2 along with a set of additional Python packages. We use the Conda package manager to establish and manage the environment and dependencies required by these tools. The GATK Docker image comes with this environment pre-configured. In order to establish an environment suitable to run these tools outside of the Docker image, we provide a Conda config file, gatkcondaenv.yml. To use this, you must first install Conda, then create the GATK-appropriate environment by running the following command:
conda env create -n gatk -f gatkcondaenv.yml
To activate the environment once it has been created, run the command:
source activate gatk
See the Conda documentation for additional information about using and managing Conda environments.
If you plan to build GATK from source, you will need Git 2.5 or greater, git-lfs 1.1.0 or greater, and Gradle 3.1 or greater. Use the ./gradlew script to build from source; see the Github repository README for more details.
We do not provide guidelines for hardware requirements, as these can vary enormously depending on the type of work you plan to do. However, you may find the following helpful:
Our collaborators at the Intel-Broad Center for Genomic Data Engineering can provide you with recommended hardware configurations based on your planned usage. Let us know in the comment thread if you'd like us to introduce you.
As noted above, we make our own cloud-based analysis portal freely available to everyone. It is built on Google Cloud; using the portal is free of charge, and compute/storage/egress costs are charged directly by Google. The advantage to you of using this portal is that we have already set up preconfigured workspaces for all the GATK Best Practices (including runtime hardware parameters, memory etc), and you also have the option of adding your own custom pipelines. This removes most of the typical limitations and guesswork involved in working with local infrastructure, and it also makes it easier to share your results and methods.
In addition, we are working with all the other major commercial cloud vendors to make it easy to run GATK pipelines on their platforms. See the "Pipelining Options" documentation for more details.
"Our collaborators at the Intel-Broad Center for Genomic Data Engineering can provide you with recommended hardware configurations based on your planned usage. Let us know in the comment thread if you'd like us to introduce you."
Can you please introduce me to "Intel-Broad Center for Genomic Data Engineering"? Need guidance on hardware configuration, appreciate your help.
Please sign in to leave a comment.