Fatal error detected by the Java Runtime Environment when running MarkDuplicates
This issue is being filed on behalf of Eduardo Maury.
Description of issue
Eduardo Maury wrote to Terra Support to get assistance with running this featured workflow processing-for-variant-discovery-gatk. After a series of troubleshooting steps, Eduardo has created their own version of the workflow as the featured workflow did not include the -Xmx flags for the MarkDuplicates and SortAndFixTags tasks. You can see a copy of this WDL script by going to the bottom of this page under Submitted workflow script.
Eduardo is currently experiencing the following error message
# A fatal error has been detected by the Java Runtime Environment:
# SIGSEGV (0xb) at pc=0x00007f632e970bf1, pid=17, tid=0x00007ecf1d905700
# JRE version: OpenJDK Runtime Environment (8.0_242-b08) (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.242-b08 mixed mode linux-amd64 )
# Problematic frame:
# V [libjvm.so+0x9d1bf1]
# Core dump written. Default location: /cromwell_root/core or core.17
# An error report file with more information is saved as:
# If you would like to submit a bug report, please visit:
We thought this might have been happening because the bam was unsorted, and MarkDuplicates expects coordinate- or query-sorted inputs. Eduardo updated the WDL so the MergeBamAlignment task uses --SORT_ORDER "queryname" instead of --SORT_ORDER "unsorted" which the featured workflow uses. The issue persisted despite this change.
The issue also persisted after changing to version 220.127.116.11.
REQUIRED for all errors and issues:
a) GATK version used: 18.104.22.168 (also tried 22.214.171.124)
b) Exact command used:
For MarkDuplicates task
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Dsamjdk.compression_level=5 -Xms550G -Xmx590G -XX:+UseSerialGC -jar /gatk/gatk-package-126.96.36.199-local.jar MarkDuplicates --INPUT /cromwell_root/fc-b266996d-0c10-45f7-bd1c-39cb4eef6aa5/submissions/ef865dc6-fce6-46eb-a9ab-2e3c1c155d30/PreProcessingForVariantDiscovery_GATK4/1ec7301d-6836-405a-8d36-44d3bbac2a9d/call-MergeBamAlignment/attempt-4/RP-1044_00485262_v1_WGS_GCP.unmapped.aligned.unsorted.bam --OUTPUT RP-1044_00485262_v1_WGS_GCP.b37.aligned.unsorted.duplicates_marked.bam --METRICS_FILE RP-1044_00485262_v1_WGS_GCP.b37.duplicate_metrics --VALIDATION_STRINGENCY SILENT --OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 --ASSUME_SORT_ORDER queryname --CREATE_MD5_FILE true --SORTING_COLLECTION_SIZE_RATIO 0.125
c) Entire program log:
Thank you Jason Cerrato for posting this issue! I'm sorry this has taken so long to get to the bottom of the problem, but I hope we will be able to figure this out soon.
The next step is that we want to determine if there is anything about this specific BAM file that could be causing these issues. Something like extremely large duplicate sets could be the reason why MarkDuplicates is failing.Eduardo, would it be possible to get more information from you about the BAM file?
- What type of sequencing data is this? Is it amplicon data?
- Have you tried yet to disable optical duplicate detection with READ_NAME_REGEX set to null (in the MarkDuplicates step)?
- Can you check the sort order tag in the header of the input BAM to MarkDuplicates and verify that it says "SO:queryname" (first line of the BAM header)?
- Can you provide an IGV screenshot of a representative section of the BAM, and/or some metrics such as size, total number of reads, maximum depth, etc.?
Thank you for your help looking into this issue further.
This is just WGS, sequenced and QC’ed at the broad with their standard pipelines and made available in a Terra workspace. Have not tried disabling the optical duplicate mark.
Ok thanks! That is good to know. Can you try disabling the optical duplicate marking and provide some of the details from question #4?
Here is an IGV plot from a random region.
Currently running two samples with the following stats (let me know if you need any other):
Total reads: 2104685646, 2425881222
Mean coverage: 83.141155, 90.941138
Chimera rate: 0.007568, 0.008076
Ran pipeline with READ_NAME_REGEX set to null. Still get errors.
# A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fb0b0caebf1, pid=17, tid=0x00007fb0adb2e700 # # JRE version: OpenJDK Runtime Environment (8.0_242-b08) (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08) # Java VM: OpenJDK 64-Bit Server VM (25.242-b08 mixed mode linux-amd64 ) # Problematic frame: # V [libjvm.so+0x9d1bf1] # # Core dump written. Default location: /cromwell_root/core or core.17 # # An error report file with more information is saved as: # /cromwell_root/hs_err_pid17.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp #
any further recommendations? is it possible to set up a meeting to discuss?
Hi Eduardo Maury,
I am sorry this is taking so long to diagnose, this is definitely looking like a GATK bug/issue. Unfortunately, we are not able to do meetings for GATK support tickets.
I did get some further feedback from our GATK developers with some options you can try to diagnose where the issue is coming from.
First, could you try disabling the Intel Compressor/Decompressor tool called "Snappy"? It will help deduce the origin of the issue. If the code runs successfully with Snappy disabled, then we will know it's a pretty serious bug. You can do this with the java option -Dsamjdk.snappy.disable=true.
Then second, could you try running the job with the jdk deflater/inflater? I can't remember if we have tried this already. If not, I definitely think you should try it. These are GATK options, -use_jdk_deflater and -use_jdk_inflater.
Once again, I'm really sorry this issue has caused you such a big headache. I hope we can get to the bottom of what is causing it soon.
currently running without Snappy. For the deflater/inflater, which one should I run, or am I supposed to run both?
without snappy I get the following error:
Failed to evaluate 'flowcell_unmapped_bams' (reason 1 of 1): Evaluating read_lines(flowcell_unmapped_bams_list) failed: Failed to read_lines("gs://fc-b266996d-0c10-45f7-bd1c-39cb4eef6aa5/d53b3f2b-5b38-4197-bb13-c44e21069ff1/MergeUnmappedBAMFiles/c4b1a615-4ae9-41a0-a603-7d5fc27e2df8/call-PerformMergeOperation/RP-1044_00485262_v1_WGS_GCP.unmapped.bam") (reason 1 of 1): [Attempted 1 time(s)] - IOException: Could not read from gs://fc-b266996d-0c10-45f7-bd1c-39cb4eef6aa5/d53b3f2b-5b38-4197-bb13-c44e21069ff1/MergeUnmappedBAMFiles/c4b1a615-4ae9-41a0-a603-7d5fc27e2df8/call-PerformMergeOperation/RP-1044_00485262_v1_WGS_GCP.unmapped.bam: File gs://fc-b266996d-0c10-45f7-bd1c-39cb4eef6aa5/d53b3f2b-5b38-4197-bb13-c44e21069ff1/MergeUnmappedBAMFiles/c4b1a615-4ae9-41a0-a603-7d5fc27e2df8/call-PerformMergeOperation/RP-1044_00485262_v1_WGS_GCP.unmapped.bam is larger than requested maximum of 10000000 Bytes.
For the deflater and inflater, run with both options!
I'm passing along your snappy update, thank you for trying that. Do you have the Terra job manager link in case we need to see more?
Eduardo Maury it looks like when you tried Snappy, the WDL wasn't configured properly. The error you posted is a WDL error, not a GATK error.
I found the job manager link and it looks like the job did not start correctly.
The WDL error would have been caught by the compiler that stores the WDL scripts as it runs proper formatting. This error has not been reported on previous times I ran the code. Not sure what the error could be if I only made the edit you suggested re: snappy.
Eduardo Maury I think you edited an older version of the WDL. For example, there is no Xmx option in your MarkDuplicates command, which I remember we edited.
That is correct. I just re-ran with the correct version and disabling snappy. There is still an error.
Here is the run information: workspace-id: b266996d-0c10-45f7-bd1c-39cb4eef6aa5submission-id: 7405aabe-a0b2-40e2-9b59-e263a54615d4
Thank you for the update Eduardo Maury! Could you also try with the jdk inflater and deflater?
Still with errors using inflater/deflater
workspace-id: b266996d-0c10-45f7-bd1c-39cb4eef6aa5submission-id: 23aed5a9-2cb0-403e-b1d9-db351e2c02f3
Thanks Eduardo Maury. The developers I'm working on with this issue are out of office this week. They will be back next week so I'm hoping to get something else for you to try then.
I am able to help with that second sample that completed the MarkDuplicates step. It looks like SortSam ran out of disk space. (Caused by: java.io.IOException: No space left on device). I'm wondering if that job doesn't have enough memory for your Xmx 590G. Jason Cerrato could you take a look at how the SortSam step is set up?
Genevieve Brandt (she/her) I'm seeing the WDL for the script in the Workflow Dashboard here. This is the relevant part of the WDL:
Combined with the input for the task
command_mem_gb_sort = 550
command_mem_gb_fix = 60
command_mem_gb = 590
making -Xms 550G and -Xmx 590G for the SortSam command and
-Xms 60G and -Xmx 590G for the SetNmMdAndUqTags command.
The task has 600 GB disk space as well.
Does it matter if the SORT_ORDER for SortSam is "coordinate"?
Is the question above addressed to me?
Eduardo Maury thanks for checking in, we don't need anything from you right now. I'm looking into it on my end with the info from Jason. Just pinged the developers again regarding the SortAndFixTags step.
Hi Eduardo Maury,
Since the second sample is in the SortAndFixTags step, I have some recommendations for how to proceed with that step.
Could you try doubling the disk space for the SortAndFixTags step? And then also set the --TMP_DIR argument for the SortSam GATK command to a directory within your working directory. It looks the SortSam command is running out of disk space.
Let us know how that goes. And I'll also get back to you next week regarding the sample that is failing the MarkDuplicates step.
Hi Eduardo Maury,
I have an update regarding your sample #1 which is failing the MarkDuplicates step. Our Picard expert took a look and thinks that you should re-try with a normal size machine, along with setting the java xmx option and decreasing the sorting collection size. They think that something is overflowing and causing these major problems. Here is what they recommend you try:
- Input: queryname sorted bam
- Remove -Dsamjdk.snappy.disable=true
- --SORTING_COLLECTION_SIZE_RATIO 0.125
- --ASSUME_SORT_ORDER queryname
- Remove --USE_JDK_INFLATER true --USE_JDK_DEFLATER true
- Set the machine to 16GB memory and 16GB disk space
Please let me know if you have any questions regarding this! And keep me posted if it fails or succeeds.
Currently trying to run with these parameters. I have mentioned this in the past, but in the short term is there a pipeline that could just get me the bwa aligned bams? I really just need to re-align to hg19. Ideally the samples would be cleaned and marked with duplicates, but optimizing this current pipeline has taken over 2months...
Yes, the pipeline you have already run has bwa aligned bams. You could take the output from the MergeBamAlignment step if you want those.
So with the new specifications one of the samples ran to completion. However, there is still an error with one of the samples, although now it is able to pass the markduplicates step.
workspace-id: b266996d-0c10-45f7-bd1c-39cb4eef6aa5submission-id: 6b54d9f9-a826-4632-b51b-4a1052a710bd
Thank you Eduardo Maury, this is great news! I am so glad that both of the samples are now past the MarkDuplicates step. For the sample that failed SortSam, could you try with the recommendations in this comment?
Not sure which recommendation specific you are referring to since we tried many on the comment linked. Which one are you referring to?
Eduardo Maury never mind for those recommendations. I was able to have more developers take a look today and they think the boot disk error message was a red herring.
For the sample that is failing during SortSam, could you specify these parameters for the SortAndFixTags step:
- Decrease the memory for the task to 16 GB and the memory for each GATK command to 14 GB (--java-options "-Xmx14G"). Right now the task has 600 GB and each GATK command has 590 GB.
- Increase the disk space for the task to 800 GB
This should work because we think the disk space needs to be more than 2x your input bam size for SortSam. And decreasing the memory should greatly decrease the cost too.
Let me know if that works!
Still with error. I don't think we are able to allocate 800GB for a tast on terra.
workspace-id: b266996d-0c10-45f7-bd1c-39cb4eef6aa5submission-id: f4a815ce-e540-470a-839e-431848993ef4
Thanks Eduardo, I'm looking into this with the Terra team.
Please sign in to leave a comment.