Call-caching not functional running cromwell locally on GCP VM
AnsweredHi all,
I am running warp's single exome analysis pipeline locally on Ubuntu 20.04 in a GCP VM with 8 cores and 64GB RAM. \The tasks employ gatk:4.1.8.0. The input is a 16GB WES data named S1340Nr1A.unmapped.bam. The exact command used:
nohup sudo time java -Dconfig.file=/home/arvadosgcp/local_cromwell_config.conf -jar cromwell-54.jar run warp/pipelines/broad/dna_seq/germline/single_sample/exome/ExomeGermlineSingleSample_deneme.wdl --inputs S1340Nr1A.json &
The content of my config.conf file:
# This is an example of how you can use the LocalExample backend to define
# a new backend provider. *This is not a complete configuration file!* The
# content here should be copy pasted into the backend -> providers section
# of the cromwell.examples.conf in the root of the repository. You should
# uncomment lines that you want to define, and read carefully to customize
# the file. If you have any questions, please open an issue at
# https://www.github.com/broadinstitute/cromwell/issues
# Documentation
# https://cromwell.readthedocs.io/en/stable/backends/Local/
# Define a new backend provider.
LocalExample {
# The actor that runs the backend. In this case, it's the Shared File System (SFS) ConfigBackend.
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
# The backend custom configuration.
config {
# Optional limits on the number of concurrent jobs
#concurrent-job-limit = 5
# If true submits scripts to the bash background using "&". Only usefull for dispatchers that do NOT submit
# the job and then immediately return a scheduled job id.
run-in-background = true
# `temporary-directory` creates the temporary directory for commands.
#
# If this value is not set explicitly, the default value creates a unique temporary directory, equivalent to:
# temporary-directory = "$(mktemp -d \"$PWD\"/tmp.XXXXXX)"
#
# The expression is run from the execution directory for the script. The expression must create the directory
# if it does not exist, and then return the full path to the directory.
#
# To create and return a non-random temporary directory, use something like:
# temporary-directory = "$(mkdir -p /tmp/mydir && echo /tmp/mydir)"
# `script-epilogue` configures a shell command to run after the execution of every command block.
#
# If this value is not set explicitly, the default value is `sync`, equivalent to:
# script-epilogue = "sync"
#
# To turn off the default `sync` behavior set this value to an empty string:
# script-epilogue = ""
# `glob-link-command` specifies command used to link glob outputs, by default using hard-links.
# If filesystem doesn't allow hard-links (e.g., beeGFS), change to soft-links as follows:
# glob-link-command = "ln -sL GLOB_PATTERN GLOB_DIRECTORY"
# The list of possible runtime custom attributes.
runtime-attributes = """
String? docker
String? docker_user
"""
# Submit string when there is no "docker" runtime attribute.
submit = "/usr/bin/env bash ${script}"
# Submit string when there is a "docker" runtime attribute.
submit-docker = """
docker run \
--rm -i \
${"--user " + docker_user} \
--entrypoint ${job_shell} \
-v ${cwd}:${docker_cwd} \
${docker} ${script}
"""
# Root directory where Cromwell writes job results. This directory must be
# visible and writeable by the Cromwell process as well as the jobs that Cromwell
# launches.
root = "cromwell-executions"
# Root directory where Cromwell writes job results in the container. This value
# can be used to specify where the execution folder is mounted in the container.
# it is used for the construction of the docker_cwd string in the submit-docker
# value above.
dockerRoot = "/cromwell-executions"
# File system configuration.
filesystems {
# For SFS backends, the "local" configuration specifies how files are handled.
local {
# Try to hard link (ln), then soft-link (ln -s), and if both fail, then copy the files.
localization: [
"hard-link", "soft-link", "copy"
]
# Call caching strategies
caching {
# When copying a cached result, what type of file duplication should occur.
# For more information check: https://cromwell.readthedocs.io/en/stable/backends/HPC/#shared-filesystem
duplication-strategy: [
"soft-link", "hard-link", "copy"
]
# Strategy to determine if a file has been used before.
# For extended explanation and alternative strategies check: https://cromwell.readthedocs.io/en/stable/Configuring/#call-caching
hashing-strategy: "md5"
# When true, will check if a sibling file with the same name and the .md5 extension exists, and if it does, use the content of this file as a hash.
# If false or the md5 does not exist, will proceed with the above-defined hashing strategy.
check-sibling-md5: false
}
}
}
# The defaults for runtime attributes if not provided.
default-runtime-attributes {
failOnStderr: false
continueOnReturnCode: 0
}
}
}
# Optional call-caching configuration.
call-caching {
# Allows re-use of existing results for jobs you've already run
# (default: false)
enabled = true
# Whether to invalidate a cache result forever if we cannot reuse them. Disable this if you expect some cache copies
# to fail for external reasons which should not invalidate the cache (e.g. auth differences between users):
# (default: true)
#invalidate-bad-cache-results = true
# The maximum number of times Cromwell will attempt to copy cache hits before giving up and running the job.
#max-failed-copy-attempts = 1000000
# blacklist-cache {
# # The call caching blacklist cache is off by default. This cache is used to blacklist cache hits based on cache
# # hit ids or buckets of cache hit paths that Cromwell has previously failed to copy for permissions reasons.
# enabled: true
#
# # A blacklist grouping can be specified in workflow options which will inform the blacklister which workflows
# # should share a blacklist cache.
# groupings {
# workflow-option: call-cache-blacklist-group
# concurrency: 10000
# ttl: 2 hours
# size: 1000
# }
#
# buckets {
# # Guava cache concurrency.
# concurrency: 10000
# # How long entries in the cache should live from the time of their last access.
# ttl: 20 minutes
# # Maximum number of entries in the cache.
# size: 1000
# }
#
# hits {
# # Guava cache concurrency.
# concurrency: 10000
# # How long entries in the cache should live from the time of their last access.
# ttl: 20 minutes
# # Maximum number of entries in the cache.
# size: 100000
# }
#
# }
}
Essentially I just enabled the call-caching option and re-ordered the call-caching strategy to prioritize soft-link. Incorporating this config file into my cromwell runs seems to be not changing any behavior with the same input and command. Any ideas? Thanks a lot!
-
Hi, our GATK support team is focused on questions involving GATK issues or abnormal results. You can see our support policy here. For questions regarding Cromwell, we encourage other users to help each other find solutions. You can also look at these resources for more information:
- Bioinformatics Stack Exchange
- Cromwell slack organization: cromwellhq.slack.com
- Cromwell Documentation
-
Hi Genevieve,
I have a similar issue with cache-calling not working. I've looked over the documentation and also could not find any relevant results on Bioinformatics Stack Exchange. How can I join the cromwell slack organization?
Thank you,
Morgan
-
I believe you can join the workspace at https://slack.com/. Please let me know if it does not work!
Best,
Genevieve
-
Hi Genevieve Brandt (she/her),
I tried accessing the link cromwellhq.slack.com, but it says I don't have an account on the workspace. How could I get an invite to the channel?
Thanks,
Morgan
-
I see, I'll look into it!
-
Great thanks!
-
Morgan Worthington Here is a temporary link to join the workspace, it will expire after some time: https://join.slack.com/t/cromwellhq/shared_invite/zt-dxmmrtye-JHxwKE53rfKE_ZWdOHIB4g
-
It worked! Thanks again!
-
Great!
Please sign in to leave a comment.
9 comments