Contents
- Java command basics
- Using the
gatk
wrapper script (recommended) - Adding GATK arguments
- Adding Java arguments
- Adding Spark arguments
- Examples of real commands
1. Java command basics
GATK follows the basic Java command-line syntax:
java -jar program.jar [program arguments]
The core of the command is java -jar program.jar
, which starts up the program in a Java Virtual Machine (JVM).
2. Using the gatk
wrapper script (recommended)
We provide a launch script that encapsulates the java -jar program.jar
part of the command in a single invocation, gatk
. There are several reasons for this that we don't go into in this article (including that there are now two jars included in the package you download), but the upshot is that it makes it possible to add GATK to your PATH variable, and it allows us to build in some autocomplete functionality for convenience.
So the basic command is now:
gatk [program arguments]
3. Adding GATK arguments
The only universally required argument is the name of the GATK tool you want to run. It is a positional argument, so you specify it directly after the gatk
bit, like this:
gatk ToolName [tool arguments]
After the tool name, you can specify any arguments in any order, with the appropriate argument name as follows:
gatk ToolName --argument-name value
Argument naming conventions
The overwhelming majority of argument names follow a "kebab" convention, where the name is prefixed by two dashes (--
) and where applicable, words are separated by single dashes (-
). A minority of very commonly-used arguments accept a short name prefixed by a single dash (-
). The short name is often a single capital letter.
Ordering
The ordering of GATK arguments is not important, but we recommend passing required arguments first for consistency. It is also a good idea to consistently order arguments by some kind of logic in order to make it easy to compare different commands over the course of a project. It’s up to you to choose what that logic should be.
Flags
Flags are arguments that have boolean values, i.e. TRUE or FALSE. They are typically used to enable or disable specific features; for example, --QUIET
will suppress some log output. To activate a flag that is set to FALSE by default, all you need to do is add the flag name to the command (no need to specify an actual value). To deactivate a flag that is set to TRUE by default, you need to specify the value as FALSE; for example --create-output-variant-index FALSE
will disable automatic variant indexing.
4. Adding Java arguments
Normally you would insert any java-specific arguments (such as -Xmx
to specify memory allocation) between the java
and -jar
bits of the basic Java command like this:
java -Xmx4G -jar program.jar [program arguments]
When you're using the gatk
wrapper syntax (which we strongly recommend), you have to do it a bit differently, like this:
gatk --java-options "-Xmx4G" [program arguments]
To specify multiple Java arguments, just add them to the quoted string like this:
gatk --java-options "-Xmx4G -XX:+PrintGCDetails" [program arguments]
The order of Java arguments inside the quoted string is not important.
5. Adding Spark arguments
When you run Spark-capable tools, you may need to specify Spark-specific parameters. These must be appended to the end of your GATK command, after a --
separator, like this:
gatk [GATK arguments] -- [Spark arguments]
6. Examples of real commands
This is a very simple command that runs HaplotypeCaller in default mode on a single input BAM file containing sequence data and outputs a VCF file containing variant calls.
gatk HaplotypeCaller -R reference.fasta -I sample1.bam -O variants.vcf
Now let's switch to running HaplotypeCaller in GVCF mode so that we can add multiple samples to our analysis in a scalable way:
gatk HaplotypeCaller -R reference.fasta -I sample1.bam -O variants.g.vcf -ERC GVCF
We can write this same command on multiple lines to make it more readable by using backslashes at the ends of lines:
gatk HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF
We can add the common Java memory argument -Xmx
like this:
gatk --java-options "-Xmx4G" HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF
If the data is from exome sequencing, we should additionally provide the exome targets using the -L
argument:
gatk --java-options "-Xmx4G" HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF \ -L exome_intervals.list
Now let's say we want to add a read filter that deals with some problems in our data:
gatk --java-options "-Xmx4G" HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF \ -L exome_intervals.list \ --read-filter OverclippedReadFilter
If we want to reduce the amount of chatter in the logs, we can turn on the --QUIET
setting like this:
gatk --java-options "-Xmx4G" HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF \ -L exome_intervals.list \ --read-filter OverclippedReadFilter \ --QUIET
And finally, if we want to turn off automatic variant index creation:
gatk --java-options "-Xmx4G" HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF \ -L exome_intervals.list \ --read-filter OverclippedReadFilter \ --QUIET \ --create-output-variant-index FALSE
For more examples of commands and for specific tool command recommendations, see the tool index.
3 comments
Dear GATK team,
I'd like to thank you for your dedication and efforts to push forward the bioinformatics.
I'm not a professional bioinfomatic person. I've been exploring with the GATK best practice pipeline with my WES data.
I got stuck at MarkDuplicatesSpark, which requires a JDK8 to run but I have JDK11 installed.
https://gatk.broadinstitute.org/hc/en-us/community/posts/360056174592-MarkDuplicatesSpark-crash
I got the exact same error message.
I had a conda environment established and installed a JDK8 there. However, I have to run a ./java -version under the installed folder to see a JDK 1.8. Otherwise, when I type java -version anywhere else, I see version 11.
I wonder under this circumstance how am I supposed to run MarkDuplicatesSpark. I hope that I can add an java option somewhere within the command but not sure.
Hopefully you could also add the solution to the MarkDuplicatesSpark manual as newer versions of java will be gaining popularity.
All the best and thank you.
Field
Hi,
I am using the following command for each autosomes and sex chromosome intervals in HaplotypeCaller and I receive memory insufficient error. See the below for the error. Could anyone give a solution for this? FYI, I am using large cluster (great amount of RAM, CPU...) and data are recalibrated CRAM files of WGS.
gatk --java-options "-Xms20G -Xmx20G -XX:ParallelGCThreads=2" HaplotypeCaller \
ERROR Message:
Dear GATK team,
I'd like to ask a question about the parameter Settings of HaplotypeCaller. When doing SNP calling, I only want to refer to some scaffolds of the reference genome (like just autosomes), what parameters should be added?
Thanks and all the best,
Bo Xiao
Please sign in to leave a comment.