The gatk-workflows git organization houses a set of repositories containing workflows contributed by the Broad Institute and optimized versions of these workflows contributed by Intel to take advantage of the latest technologies like FPGA processors to accelerate time and performance. The workflows made available include several types of genomic analysis methods using GATK’s Best Practices, such as Data Preprocessing for Variant Discovery, Somatic Sequence Analysis using Mutect, and simpler workflows used for sequence format conversion.
The provided workflows have an accompanying JSON file containing references, resources, default parameters, and input bam files used to test the workflow on the users given platform. The document below will guide users on executing an example workflow on the Google Cloud Platform as well as running the workflow locally.
Please note that Broad is moving towards a cloud-centric computing environment, thus the provided workflows are designed and intended to work on the cloud. Some of these workflows may need to be modified by the user before executing on a local environment.
Key Google Cloud Buckets
Running Workflows Using Google Cloud Platform
General Prerequisites:
- Google Account linked to a Billing project
Tool Prerequisites:
Instructions:
Setup your working directory. Make a directory to test workflows then change into that directory.
mkdir gatk-workflows cd gatk-workflows
Download latest release of Cromwell, the java excutable that will run the WDL.
wget https://github.com/broadinstitute/cromwell/releases/download/33.1/cromwell-33.1.jar
Clone the repository you would like to execute. In this example we will being executing validate bam from the seq-format-validation repository.
git clone https://github.com/gatk-workflows/seq-format-validation.git
Once you’ve successfully cloned the repository, seq-format-validation
directory will appear in your gatk-workflows
working directory. The seq-format-validation
directory has multiple files but we are only concerned with the WDL and its json.
We’ll be running the validate-bam.wdl
workflow using its accompanying json file validate-bam.inputs.json
. The json contains some required/optional parameters needed to run the workflow, including the path to a test input file located in a Google Cloud bucket.
We have our WDL and we have a json file, but we need one more file to run on Google Cloud. This would be a configuration file to indicate to Cromwell that we would like to execute our workflow on the cloud. You can create your own configuration using the instructions found on Cromwell Documentation. In this example we'll name our conf file google-adc.conf
and copy the contents below into our file.
Create and edit conf file
vim google-adc.conf
Copy the contents below into the file
include required(classpath("application")) google { application-name = "cromwell" auths = [ { name = "application-default" scheme = "application_default" } ] } engine { filesystems { gcs { auth = "application-default" } } } backend { default = "JES" providers { JES { actor-factory = "cromwell.backend.impl.jes.JesBackendLifecycleActorFactory" config { // Google project project = "" compute-service-account = "default" // Base bucket for workflow executions root = "/cromwell-execution" // Polling for completion backs-off gradually for slower-running jobs. // This is the maximum polling interval (in seconds): maximum-polling-interval = 600 // Optional Dockerhub Credentials. Can be used to access private docker images. dockerhub { // account = "" // token = "" } genomics { // A reference to an auth defined in the `google` stanza at the top. This auth is used to create // Pipelines and manipulate auth JSONs. auth = "application-default" // Endpoint for APIs, no reason to change this unless directed by Google. endpoint-url = "https://genomics.googleapis.com/" } filesystems { gcs { // A reference to a potentially different auth for manipulating files via engine functions. auth = "application-default" } } } } } } system { input-read-limits { lines = 1280000 bool = 7 int = 19 float = 50 string = 1280000 json = 1280000 tsv = 1280000 map = 1280000 object = 1280000 } }
At this point your directory structure should look like this
|-gatk-workflows/ |-cromwell-33.1.jar |-google-adc.conf |-seq-format-validation/ |-LICENSE |-README.md
|-Generic.google-papi.options.json |-Validate-bam.inputs.json |-validate-bam.wdl
Before you execute the workflow you'll need to two pieces of information: 1) the project that will pay for the run, and 2) where to store your output files.
The set project name can be determined by entering gcloud info
in your terminal. The project name will be listed under "Current Properties".
Current Properties: [core] project: [your-project-name] account: [your-account@gmail.com] disable_usage_reporting: [True] [compute] region: [us-central1] zone: [us-central1-a]
The location of your bucket is completely up to you. It can be one you create or one that was designated to you by the project owner. An example would be gs://my-bucket/
It's time to execute the workflow.
java -Dconfig.file=google-adc.conf -Dbackend.providers.JES.config.project= - Dbackend.providers.JES.config.root=gs:/// -jar cromwell-33.1.jar run ./seq-format-validation/validate- bam.wdl --inputs ./seq-format-validation/validate-bam.inputs.json
While the workflow is running, Cromwell will print out logs to your screen (lots of it). Once it completes it will print out a message indicating the run was successful. Also, it will print out the Google bucket location of the output files that was generated by your workflow.
You can copy the output file to a local directory using gsutil cp
. For example:
gsutil cp gs://my-bucket/path/to/output /path/to/local/directory/
Running Workflows Locally
Tool Prerequisites:
Instructions:
Setup your working directory. Make a working directory to test workflows then change into that directory.
mkdir gatk-workflows cd gatk-workflows
Make a directory to store input files.
mkdir inputs
Download latest release of Cromwell, the java excutable that will run the WDL.
wget https://github.com/broadinstitute/cromwell/releases/download/33.1/cromwell-33.1.jar
Clone the repository you would like to execute. In this example we will be executing validate bam from the seq-format-validation repository.
git clone https://github.com/gatk-workflows/seq-format-validation.git
Once you’ve successfully cloned the repository, seq-format-validation
directory will be in your gatk-workflows
working directory. The seq-format-validation
directory has multiple files but we are only concerned with the wdl and its json.
We’ll be running the validate-bam.wdl
workflow using its accompanying json file validate-bam.inputs.json
. The json contains some required/optional parameters needed to run the workflow, including the path to the input file located in a Google Cloud bucket. Since we're running this locally we’ll need to first download any files mentioned in the json. In this case we’ll only need to download the input files but the same instructions can be used for reference/resource files. *Special note, because this is a local demo and the size of the medium bam file is 18 GB, we’ll only download and work with the small bam file.
The input files listed in the json file are stated to be in the following google buckets
"gs://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam",
"gs://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_med.hg38.bam"
The base Google bucket name is gs://gatk-test-data
, a weblink to this Google bucket is provided in this document under the subtitle Key Google Cloud buckets. We use this web link to take us to the Google bucket of interest in our web browser then use the file path provided in the json (e.g. /wgs_bam/NA12878_24RG_hg38/NA12878_24RG_med.hg38.bam
) to locate the files in the bucket. The following files can be downloaded by clicking on the file names.
Once the file is downloaded, be sure it is moved to gatk-workflows/inputs/
directory.
Next we’ll edit the json file so that all gs://
file paths are replaced local file paths.
Change directories to the cloned repository
cd seq-format-validation
Edit the json file to replace the input file path.
vim validate-bam.inputs.json
After replacing the input file paths you should have something like this:
{ ##Comment1:Input, ValidateBamsWf.bam_array: [ "/home/username/gatk-workflows/inputs/NA12878_24RG_small.hg38.bam"], ##Comment2:Parameter, ValidateBamsWf.ValidateBAM.validation_mode: SUMMARY, ##Comment3:Runtime - uncomment the lines below and supply a valid docker container to override the default, ValidateBamsWf.ValidateBAM.mem_size: 1 GB, ValidateBamsWf.ValidateBAM.disk_size: 100, ##ValidateBamsWf.ValidateBAM.gatk_path_override: String (optional), ##ValidateBamsWf.gatk_docker_override: String (optional) }
Change back to the main working directory
cd ../
At this point your directory structure should look like this:
|-gatk-workflows/ |-cromwell-33.1.jar |-inputs/ | |-NA1287824RGsmall.hg38.bam |-seq-format-validation/ |-LICENSE |-README.md
|-Generic.google-papi.options.json |-Validate-bam.inputs.json |-validate-bam.wdl
It's time to execute the workflow
java -jar cromwell-33.1.jar run ./seq-format-validation/validate-bam.wdl --inputs ./seq-format-validation/validate-bam.inputs.json
While the workflow is running cromwell will print out logs to your screen (lots of it). Once it completes it will print out a message indicating the run was successful. Also, it will print out the location of the output files that was generated by your workflow.
Side note: After the workflow completes you’ll see two directories cromwell-executions
and cromwell-workflow-logs
. cromwell-workflow-logs
will have a log file for each job you execute, while cromwell-executions
will contain outputs generated by your executed job.
cromwell-executions
directory level will have a directory for each workflow you’ve run. We’ve only run ValidateBamsWf (title of workflow found in the WDL script in workflow block) so that will be the only folder that you will see.
Important Notes
- It is the user’s responsibility to alter the json to meet their needs, example json files should not be used in production without being customized and vetted by principle scientists.
5 comments
I tried following the instructions to run locally but seem to run into problems - I get error messages when executing the workflow . I think it is because I failed to edit the json file correctly. If I just replace the path to the bam file and then attempt to exectue, the error message says
[2021-09-22 13:49:26,73] [error] WorkflowManagerActor Workflow 82ae9bf6-b62a-43b6-87ca-c874e73a0715 failed (during MaterializingWorkflowDescriptorState): cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anon$1: Workflow input processing failed:
Unexpected character '#' at input index 4 (line 2, position 3), expected '"':
##Comment1:Input,
I also tried removing all the comment lines, but then I get an error that says that V is an unexpected character. I imagine it is because I have absolutely no idea what I'm doing with .json file editing, but just wanted to check if these error messages would be expected? Where in the .json file does the valid docker container have to go?
Please consider to update the version of cromwel that you advise to download, currently the last release is the 84.
Naomi Dyer
Have you solved the below error
WorkflowManagerActor: Workflow bba60044-9b2c-425b-aade-77ca38e6691a failed (during MaterializingWorkflowDescriptorState): cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anon$1: Workflow input processing failed:
Unexpected character '#' at input index 4 (line 2, position 3), expected '"':
##Comment1:Input,
After updating the cromwel to the last version , I still get the error!!!.
Yes, please update this guide to reference the latest version of Cromwell. Running the version currently linked in this article (v33.1) led to an error when running processing-for-variant-discovery-gatk4.wdl. Updating to v84 fixed this.
Don't use the JES backend, the genomics API, Java 8, or Cromwell 33. All of those are old.
To save headaches, use this documentation:
https://cromwell.readthedocs.io/en/stable/tutorials/PipelinesApi101/#lets-get-started
If you can get a "Hello World" going, you're probably fine. Cromwell versions 60 and higher support Java 11 (Java 17 is even working for me). I'm also using PAPIv2 as a backend and the Life Sciences API as an endpoint url; making these changes finally allowed me to run my workflow.
Please sign in to leave a comment.