In a nutshell, Spark is a piece of software that GATK4 uses to do multithreading, which is a form of parallelization that allows a computer (or cluster of computers) to finish executing a task sooner. You can read more about multithreading and parallelism in GATK here. The Spark software library is open-source and maintained by the Apache Software Foundation. It is very widely used in the computing industry and is one of the most promising technologies for accelerating execution of analysis pipelines.
Not all GATK tools use Spark
Tools that can use Spark generally have a note to that effect in their respective Tool Doc.
Some GATK tools exist in distinct Spark-capable and non-Spark-capable versions
The "sparkified" versions have the suffix "Spark" at the end of their names. Many of these are still experimental; down the road we plan to consolidate them so that there will be only one version per tool.Some GATK tools only exist in a Spark-capable version
Those tools don't have the "Spark" suffix.
You don't need a Spark cluster to run Spark-enabled GATK tools!
If you're working on a "normal" machine (even just a laptop) with multiple CPU cores, the GATK engine can still use Spark to create a virtual standalone cluster in place, and set it to take advantage of however many cores are available on the machine -- or however many you choose to allocate. See the example parameters below and the local-Spark tutorial for more information on how to control this. And if your machine only has a single core, these tools can always be run in single-core mode -- it'll just take longer for them to finish.
To be clear, even the Spark-only tools can be run on regular machines, though in practice a few of them may be prohibitively slow (SV tools and PathSeq). See the Tool Index for tool-specific recommendations.
If you do have access to a Spark cluster, the Spark-enabled tools are going to be extra happy but you may need to provide some additional parameters to use them effectively. See the cluster-Spark tutorial for more information.
Example command-line parameters
Here are some example arguments you would give to a Spark-enabled GATK tool:
--spark-master local[*]
-> "Run on the local machine using all cores"--spark-master local[2]
-> "Run on the local machine using two cores"--spark-master spark://23.195.26.187:7077
-> "Run on the cluster at 23.195.26.187, port 7077"--spark-runner GCS --cluster my_cluster
-> "Run on my_cluster in Google Dataproc"
You don't need to install any additional software to use Spark in GATK
All the necessary software for using Spark, whether it's on a local machine or a Spark cluster, is bundled within the GATK itself. Just make sure to invoke GATK using the gatk
wrapper script rather than calling the jar directly, because the wrapper will select the appropriate jar file (there are two!) and will set some parameters for you.
You don't need to use --spark-runner
to locally run ApplyBQSR
If you want to, for example, specify num-executors, executor-cores, or executor-memory locally with ApplyBQSR, you may find yourself typing out --spark-runner local ApplyBQSR
. However, this will not actually do anything, and is equivalent to running the given tool without the argument.
If you run ApplyBQSR in this manner, then you will find that running Spark-specific arguments (like --num-executors
) won't work — this is because they are Spark-specific, and GATK won't recognize them as valid commands.
If you want to run ApplyBQSR locally using Spark multi-threading, use ApplyBQSRSpark instead, as in the example below:
gatk ApplyBQSRSpark --input input.bam --output output.bam –bsqr-recal-file output.baserecalibrationtable.txt
Here, you can can specify a number of threads to use by adding the argument --spark-master local[$NUM_THREADS]
. (If not specified the tool will use as many threads as there are available cores.)
If you are using a Spark cluster using spark-submit (--spark-runner SPARK
) or using a Google Cloud Dataproc (--spark-runner GCS
), then the --num-executors
and other spark arguments can be specified (separated from the GATK tool arguments by --
).
5 comments
What options need to be specified when spark is running over YARN in cluster mode. From what I understand, there is no spark daemon process listening for jobs, one simply submits jobs using spark-submit binary with the master as yarn and spark picks yarn config from $HADOOP_CONF_DIR. Is that mode not supported?
Hi
It looks like this below link listed on this page is broken:
See the example parameters below and the local-Spark tutorial for more information
"Some GATK tools only exist in a Spark-capable version
Those tools don't have the "Spark" suffix."
Is it mean tools like CombineGVCFs can run with paremater of --conf 'spark.executor.cores=8'?
Is it possible to run spark workflows on multiple nodes? If yes, how?
Hi,
This link about spark is broken:
https://gatk.broadinstitute.org/hc/en-us/articles/360035889831
Please sign in to leave a comment.