The GATK-SV pipeline was built and is meant to be used on the Terra platform. This article is written to supplement the many great resources also available on the Terra support site. If there are problems you are running into that cannot be solved by following the advice here or on the Terra site, please write to us on the GATK forum! Your feedback helps to make our resources better.
There can be a wide variety of issues when running a complicated pipeline like GATK-SV. First, we are going to give examples of common error messages along with the following video with step by step instructions for changing structs to control resource allocation - a common solution to many errors. Third, we have information on preventing unnecessary cloud spend when running GATK-SV and fourth, there are some tips from our GATK-SV developers.
Troubleshooting Error Messages
The first step in troubleshooting the GATK-SV pipeline is to look for the error message indicating the culprit of your crash. If you are new to troubleshooting in Terra, please check out these resources:
Troubleshooting in the Job Manager
The first step is to take a look at the job manager, which is the top level where Terra reports error messages. For failed workflows, Terra will pull out an error message that seems to be the main cause of the crashed workflow. Unfortunately, not all of these error messages directly describe the issue you are facing, so often, it is important to dig deeper in the log files to determine a more specific cause.
PAPI 9 - If you see a PAPI error code 9, this means your task failed in the VM. There is an article on the Terra support site describing the various reasons why you might see the PAPI error code 9. With this error message, you will need to take a closer look at the relevant log file to get to the bottom of what might be wrong.
PAPI 10 - A PAPI error code 10 is an abrupt crash of the VM. This is a non-specific error but we do usually find the underlying issue to be an out of memory problem. Take a look at this article on the Terra support site regarding PAPI error code 10. Check the relevant log files to look for any clues indicating this issue is a memory or disk issue. If there are no clues in the logs, try increasing the memory, or the memory and disk.
The compute backend terminated the job - This error message is similar to PAPI error code 10. The VM crashed abruptly and we don’t have a lot of information about why it happened. Check the error and log files to get more clues about why the VM crashed. If there is anything else in the log files indicating the job ran out of memory, then find the relevant task where you should increase the memory following the instructions in our troubleshooting video, linked here. Since increasing the disk is low cost, you can also increase the disk size to make sure the task completes.
Job exit code 137 - A job exit code 137 indicates that the task ran out of memory. Follow the instructions in our tutorial video to find which task ran out of memory and increase the “mem_gb” input for that task.
Out of memory - This error states "stderr for job
Whamg.RunWhamgIncludelist:NA:2contained one of the
memory-retry-error-keys: [OutOfMemory,Killed]specified in the Cromwell config. Job might have run out of memory." This message indicates that Terra detected a message in the log files that indicates the job ran out of memory. Terra looked through the log files so that you don’t have to! Follow the instructions in our tutorial video to find which task ran out of memory and increase the “
mem_gb” input for that task.
Helpful error messages buried in the logs
If you were not able to figure out the problem from the top level job manager, you will need to look through the log files to find more information about the problems while the workflow was running. We have an example of looking through the log files in our troubleshooting GATK-SV video. The error messages described below will help to determine what you need to change for when you try the workflow again. Please reach out to us on the GATK forum if you cannot figure out what to do from the given error message.
The logs or stderr mentions “Killed”
This is the same issue as Exit code 137, just presenting differently! This indicates that the task ran out of memory. You’ll need to find which task ran out of memory and increase the “
mem_gb” runtime attribute for the task. Detailed instructions for this process can be found in our troubleshooting GATK-SV video.
No space left on device
If your logs or stderr mention that there is “no space left on device”, then your VM ran out of disk size for one of your tasks. Generally, you will need to increase the “
disk_gb” runtime attribute for the task.
Error messages indicating that you need to double check your inputs
Could not build the path \"\". It may refer to a file system not supported by this instance of Cromwell. Supported file systems are: Google Cloud Storage. Failures: Google Cloud Storage: Path \"\" does not have a gcs scheme (IllegalArgumentException). Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystemsA “File” type in the WDL script has been assigned to an empty string.
Could not read from gs://example.bam: File example.bam is larger than 10000000 Bytes.You inputted a data file as input but the WDL expects a text file listing the URIs.
Required file output '/cromwell_root/output.txt' does not exist.
Something went wrong in a previous command and an expected output file was never created. You may be able to figure out what step went wrong further up in the log file.
BadRequestException: 400 Bucket is requester pays bucket but no user project provided.
GatherSampleEvidence: If your CRAM file is in a requester pays bucket, make sure to set the “
requester_pays_cram” column to “true” for that sample in the sample data table.
GatherBatchEvidence: If your GVCF is in a requester pays bucket, you must enter the Terra project for the workspace as one of the workflow arguments. Enter the project in the
gvcf_gcs_project_for_requester_paysargument as a string, surrounded by double-quotes. You can find the project ID associated with your Terra workspace on the top right side of the dashboard under
Google Project ID.
If you get the following error messages, try re-running the workflow.
503 Service Unavailable Backend Error
IOException: Could not read from gs://exec-bucket/MyWorkflow/workflow/Task/rc: 504 Gateway Timeout GET https://storage.googleapis.com/download/storage/v1/b/exec-bucket%2MyWorkflow%2workflow%2Task%2rc?alt=media
Permissions in Terra
pet-XXXXXXXXXXXXXXXXX@terra-XXXXXXX.iam.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage bucket
It’s common to get error messages regarding data not being properly shared with your Terra account. If you are working with Terra data, share the workspace with your Terra account to fix the issue. With non-Terra GCP data, you’ll need to share the data with your Terra proxy group. What’s a proxy group? Terra has a full overview in the article Pet service accounts and proxy groups, but for a quick summary - Terra uses additional google accounts to interact with cloud resources outside of Terra so that your user ID is not shared. The Pet service accounts and proxy groups article gives a good explanation for how to fix this issue. If you are still running into issues, reach out to the Terra support team and they help with troubleshooting the problem.
How to change resource allocation
Now that you know the problem, we demonstrate in this linked video how to find the argument that needs to be changed and how to change it.
Preventing Cloud Spend Issues
As we all know, preventing major cloud spend is really important when running large analyses like the GATK-SV cohort pipeline. We have a few big recommendations to minimize these costs and prevent big headaches!
- Turn on the option to delete intermediate files! There’s a Terra article explaining this option and the details for when you should and shouldn’t use it. We recommend it for most use cases because the GATK-SV cohort pipeline creates many temporary output files that will take an enormous amount of time in the future to delete if you don’t delete them as you go.
- Manually delete files from failed workflows once the workflow completes successfully - especially for GatherSampleEvidence because intermediate BAM files are very large.
- Perform preliminary sample QC and sample exclusion prior to starting GATK-SV.
- We recommend removing outlier samples from analysis. Outlier samples can cause issues during variant clustering and filtering steps which can lead to more expenses and false positive calls in other samples. However, we also understand that with clinical studies, it’s not always possible to remove samples. The GATK-SV pipeline is generally able to handle outliers and as you continue your analysis you can look at the QC results to determine if the outliers are causing issues.
- Test each workflow on one sample or batch to test that it completes successfully before launching all jobs.
Tips from our Developers
Since GATK-SV is a complicated pipeline, we find that small issues in your setup of the pipeline can lead to error messages that are difficult to diagnose. It can appear that you have a memory or disk issue, when really the problem is formatting.
- Use the pre-configured workflows and data tables rather than manually supplying inputs.
- Adhere to the sample ID and batch name requirements. You can read about these specifications in the GATK-SV README.
- Check the formatting of your PED file and make sure it follows the specifications in this Pedigree format description.