Mutect2-PON Storage object issue when running across multiple samples?
GATK version used : 4.1.6.0
Hello,
I am getting the following error when running the Mutect2_pon workflow across multiple samples (but the same worflow ran successfully over 2 samples):
Failed to evaluate 'Mutect2.tumor_reads_size' (reason 1 of 1): Evaluating ceil(size(tumor_reads, "GB") + size(tumor_reads_index, "GB")) failed: [Attempted 1 time(s)] - StorageException: pet-116655800279130056518@htapp-project.iam.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object.
The workspace is 661-Clonal hematopoiesis, and you should have an access to it already.
The submission ID is abdfaa7a-e8f6-4ac0-b08b-fed43384464e.
The workflow is: mutect2_pon
Many thanks,
Mia
-
Hi MPetlj,
Happy to help here. This error is usually due to your Proxy Group not being added to a Google bucket with the appropriate Storage Object Viewer permissions. Are you looking to access any new files in your workflow that you weren't trying to access before? Either from a Terra bucket associated with a workspace you are not added to, or an external Google bucket that your Proxy Group may not have been added to (even if your normal email account was)?
Kind regards,
Jason
-
Thanks Jason!
Let me please follow up, it appears that I temporarily lost access to some files, that was now renewed.
It is most likely that this was the issue, in line with also you suggest.
Thanks
Mia
-
Hi Mia,
Sounds good! Let us know.
Kind regards,
Jason
-
Hi Jason,
I managed to run this successfully, so it appeared to have been access.
However, when I went to run it over a larger sample set, it failed again, with a different error.
Submission ID:ff8f086e-7055-4842-8abd-f8e1e15d3a46
From the worklog:
java.lang.Exception: The WorkflowDockerLookupActor has failed. Subsequent docker tags for this workflow will not be resolved.
2020-06-25 18:15:59,947 WARN - BackendPreparationActor_for_7a338ac5:Mutect2.M2:20:2 [UUID(7a338ac5)]: Docker lookup failedNot sure why is docker failing, when I ran the workflow the day before successfully?
Many thanks
Mia
-
Hi Mia,
Hmm, that's very strange. The job details are failing to load, and we're seeing the workflow metadata returning an error because it has millions of rows, which is very strange for a mutect2_pon workflow. Can you share this workflow with jcerrato@broadinstitute.org so I can take a closer look at any differences there may be between our version of the workflow and yours?
Kind regards,
Jason
-
Thanks Jason, have done so - please let me know if you can see it.
Best wishes,
Mia
-
Jason,
The only change that was made to the workflow was the the M2 task now accepts a billing_project input and the argument ${"--gcs-project-for-requester-pays " + billing_project} was added to all gatk commands within the M2 task. Thanks.
-
Hi both,
Thanks for the info. I've been informed by one of our Cromwell engineers that there was a Quay.io outage that likely caused this issue. It's a transient issue, so running again is the best recommendation.
Kind regards,
Jason
-
Thanks Jason, when did the outage happen? Because I did rerun this last night, and it failed again (so it failed the second time over 2 days)
-
Hi Mia,
Are you seeing this in the log of any particular task? Can you share the log where you see the message so I can take a closer look and pass it on to one of our engineers if needed?
I see the originally flagged ff8f086e-7055-4842-8abd-f8e1e15d3a46 was run yesterday on June 25, and 0eb925d6-fcd1-493d-bc84-78e3d2ce4dd2 was also run yesterday. You mentioned it failed the second time over two days. Is there a run from before yesterday that ran into the same issue?
Kind regards,
Jason
-
Hi Mia,
After discussing with one of our engineers, it isn't the case that the Quay outage was the source of the issue; your workflow is pulling from GCR and Dockerhub and the timing is different from when the outage occurred. The engineer was simply reporting a similar instance that caused comparable behavior, but the error message does happen for repos other than Quay and in circumstances other than a repo outage.
If you know which part of the workflow was failing, please let us know. One of our engineers also noticed an exception in the logs that could be related, and they're investigating. I'll keep you updated.
Kind regards,
Jason
-
Hi Jason,
for ff8f086e-7055-4842-8abd-f8e1e15d3a46
The error that I pasted above:
java.lang.Exception: The WorkflowDockerLookupActor has failed. Subsequent docker tags for this workflow will not be resolved.
2020-06-25 18:15:59,947 WARN - BackendPreparationActor_for_7a338ac5:Mutect2.M2:20:2 [UUID(7a338ac5)]: Docker lookup failedwas from the worklog, I found it by searching for 'fail' and 'error', as I did not know how to otherwise approach this, because I am not getting any errors in job history other than 'failed'. If I try to go to 'view' in the job history, I get the common widow I reported before - from both terra and firecloud (attached).
However, the error above may not be the root of the problem
I think with the jobID and access I gave you before to, you should be able to access all the worklogs from this job to have a look at what you think may be happening?
Mia
-
Hi MPetlj,
Our engineers are investigating the root cause of your original run-ins with the Docker lookup failed messages, but they believe these to be transient and that you should be able to get a successful run if you try again. Can you give it another go with call caching enabled and let us know the result?
Kind regards,
Jason
-
Thanks Jason, have done so and I will let you know.
-
Hi Jason,
This failed again, the job submission is below that you can use to track down the worklog:
340c3e3c-b4ad-4388-aac8-7d967d1b3c86
Can you please let me know what are next steps?
Many thanks
Mia
-
Hi Mia,
I have informed the batch team who will take a deeper look at this. Did it fail for the same java.lang.Exception: The WorkflowDockerLookupActor has failed error or did you see it fail for another reason?
Kind regards,
Jason
-
Hi Jason,
It would be best if you could please forward the log to the team or have a look at it yourself ( you should
be able to access?).As noted before, I was never 100% sure this was the root of the problem.
Somebody with more experience than me would ideally have a thorough look through the log at this stage, to speed the process in case there are other issues to be addressed.
Thanks
-
Hi Mia,
We are looking on our end. I'll let you know what we find.
Kind regards,
Jason
-
Hi Mia,
Investigating this job, we came across this error:
2020-06-30 00:10:19 [cromwell-system-akka.actor.default-dispatcher-1436] INFO c.e.workflow.WorkflowManagerActor - WorkflowManagerActor Workflow 221cdc26-a7d9-4e4a-875d-19b6ec82d9ab failed (during ExecutingWorkflowState): java.lang.Exception: Task Mutect2.M2:11:1 failed. Job exit code 1. Check gs://fc-secure-a46c7502-d26e-4217-b1d4-7d80a20d7456/340c3e3c-b4ad-4388-aac8-7d967d1b3c86/Mutect2_Panel/221cdc26-a7d9-4e4a-875d-19b6ec82d9ab/call-Mutect2/shard-77/m2.Mutect2/66b04541-89d1-4201-89da-d5fde415f3e9/call-M2/shard-11/stderr for more information. PAPI error code 9. Please check the log file for more details: gs://fc-secure-a46c7502-d26e-4217-b1d4-7d80a20d7456/340c3e3c-b4ad-4388-aac8-7d967d1b3c86/Mutect2_Panel/221cdc26-a7d9-4e4a-875d-19b6ec82d9ab/call-Mutect2/shard-77/m2.Mutect2/66b04541-89d1-4201-89da-d5fde415f3e9/call-M2/shard-11/M2-11.log.
This is showing to be a PAPI error code 9, which likely points to either a resource deficiency, or failure to generate a needed file in the task.
There was also this part of the log, which I'm less certain about, but perhaps is helpful to you.
java.lang.IllegalStateException: Smith-Waterman alignment failure. Cigar = 432M with reference length 432 but expecting reference length of 489 ref = GAGCAGCGCCTCTTCCGACGGCCCCCATCCAGTCATCACCCCGTCACGGGCCTCAGAGAGCAGCGCCTCTTCCGACGGCCCCCATCCAGTCATCACCCCGTCACGGGCCTCAGAGAGCAGCGCCTCTTCCGACGGCCCCCATCCAGTCATCACCCCGTCACGGGCCTCAGAGAGCAGCGCCTCTTCCGACGGCCTCCATCCAGTCATCACCCCGTCACGGGCCTCAGAGAGCAGCGCCTCTTCCGACGGCCCCCATCCAGTCATCACCCCGTCACGGGCCTCAGAGAGCAGCGCCTCTTCCGACGGCCCCCATCCAGTCATCACCCCCTCATGGTCCCCGGGATCTGACGTCACTCTCCTCGCTGAAGCCCTGGTGACTGTCACAAACATCGAGGTTATTAATTGCAGCATCACAGAAATAGAAACAACGACTTCCAGCATCCCTGGGGCCTCAGACACAGATCTCATCCCCACGGAAGGGGTGAAGGC path GAGCAGCGCCTCTTCCGACGGCCCCCATCCAGTCATCACCCCGTCACGGGCCTCAGAGAGCAGCGCCTCTTCCGACGGCCCCCATCCAGTCATCACCCCGTCACGGGCCTCAGAGAGCAGCGCCTCTTCCGACGGCCTCCATCCAGTCATCACCCCGTCACGGGCCTCAGAGAGCAGCGCCTCTTCCGACGGCCTCCATCCAGTCATCACCCCGTCACGGGCCTCAGAGAGCAGCGCCTCTTCCGACGGCCCCCATCCAGTCATCACCCCCTCATGGTCCCCGGGATCTGACGTCACTCTCCTCGCTGAAGCCCTGGTGACTGTCACAAACATCGAGGTTATTAATTGCAGCATCACAGAAATAGAAACAACGACTTCCAGCATCCCTGGGGCCTCAGACACAGATCTCATCCCCACGGAAGGGGTGAAGGC
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.findBestPaths(ReadThreadingAssembler.java:354)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.assembleKmerGraphsAndHaplotypeCall(ReadThreadingAssembler.java:196)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.runLocalAssembly(ReadThreadingAssembler.java:146)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyBasedCallerUtils.assembleReads(AssemblyBasedCallerUtils.java:269)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:226)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:299)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:200)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:173)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)And the end of the log file has this error.
2020/06/29 19:56:02 Delocalizing output /cromwell_root/output.vcf.stats -> gs://fc-secure-a46c7502-d26e-4217-b1d4-7d80a20d7456/340c3e3c-b4ad-4388-aac8-7d967d1b3c86/Mutect2_Panel/221cdc26-a7d9-4e4a-875d-19b6ec82d9ab/call-Mutect2/shard-77/m2.Mutect2/66b04541-89d1-4201-89da-d5fde415f3e9/call-M2/shard-11/output.vcf.stats
Required file output '/cromwell_root/output.vcf.stats' does not exist.So it looks like this output.vcf.stats file isn't being generated, which is likely what the PAPI error code 9 is referencing. You can get the full log by using gsutil to cp gs://fc-secure-a46c7502-d26e-4217-b1d4-7d80a20d7456/340c3e3c-b4ad-4388-aac8-7d967d1b3c86/Mutect2_Panel/221cdc26-a7d9-4e4a-875d-19b6ec82d9ab/call-Mutect2/shard-77/m2.Mutect2/66b04541-89d1-4201-89da-d5fde415f3e9/call-M2/shard-11/M2-11.log.
Let me know if this provides any insight for you. Beri and I are also to have a half hour chat with you sometime today if that would be helpful. Let me know if so and I can schedule something on our calendars.
Kind regards,
Jason
-
Hi Jason,
Thank you for digging those out.
I did read through and it seems like, as you also suggest, that a critical file is not being generated.
Can someone who knows ins and outs of the workflow from your team please look into it ?
If Beri could provide advice for this particular issue, I would be happy to get on a call today before 1.30 or tomorrow.
Please let me know,
Mia
-
I believe the error: java.lang.IllegalStateExceptio
n: Smith-Waterman alignment failure. Cigar = 432M with reference length 432 but expecting reference length of 489 ref is due to a bug in GATK which is
fixed in version 4.1.7. You can use GATK 4.1.7.0 by setting the input gatk_docker to "us.gcr.io/broad-gatk/gatk:4.1.7.0 ". -
Hi Mia,
It is just as Josh says. A google of your issue reveals a GATK thread where a similar issue is represented: https://gatk.broadinstitute.org/hc/en-us/community/posts/360060174372-Haplotype-Caller-4-1-6-0-java-lang-IllegalStateException-Smith-Waterman-alignment-failure-
This issue is revealed to be a bug associated with 4.1.6.0 that has been resolved in 4.1.7.0.
Kind regards,
Jason
-
Thank you both, I will try and follow up.
-
Hello,
This failed again, possibly due to reasons beyond the one that Joshua spotted.
Submission ID: a7546857-07a3-4494-b786-be408d1c5301
I get the same error page if I try to click on 'view' under the Job History as I reported before for this workflow.
Worklog is here:
I do not see anything other than this:
2020-07-02 08:41:00,894 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:6:1]: Status change from Running to Failed
2020-07-02 08:41:04,372 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:9:1]: Status change from Running to Failed
2020-07-02 08:41:10,615 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:12:1]: Status change from Running to Failed
2020-07-02 08:41:14,432 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:14:1]: Status change from Running to Failed
2020-07-02 08:41:15,156 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:7:1]: Status change from Running to Failed
2020-07-02 08:41:20,700 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:10:1]: Status change from Running to Failed
2020-07-02 08:41:24,871 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:11:1]: Status change from Running to Failed
2020-07-02 08:41:25,151 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:8:1]: Status change from Running to Failed
2020-07-02 08:41:33,470 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:3:1]: Status change from Running to Failed
2020-07-02 08:41:53,602 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:0:1]: Status change from Running to Failed
2020-07-02 08:41:57,068 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:13:1]: Status change from Running to Failed
2020-07-02 08:42:00,511 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:1:1]: Status change from Running to Failed
2020-07-02 08:42:03,299 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:16:1]: Status change from Running to Failed
2020-07-02 08:42:08,140 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:15:1]: Status change from Running to Failed
2020-07-02 08:42:08,144 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:2:1]: Status change from Running to Failed
2020-07-02 08:42:08,146 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:4:1]: Status change from Running to Failed
2020-07-02 08:42:31,767 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(bd43a719)Mutect2_Panel.CreatePanel:5:1]: Status change from Running to FailedThe major issue here is that workflow is not producing the informative errors under Job History - perhaps this can be raised with authors, while we also figure out what is wrong here. Please let me know
Thanks
Mia
-
Hi Mia,
Thanks for letting me know. Job History/Job Manager is not showing the job details due to the size of the metadata output by the job. Since the metadata is over two million lines, Cromwell is stopping it from fully writing out to avoid failure. Cromwell team is working on improvements to better handle jobs with such large amounts of metadata.
You may be able to find more helpful error messaging if you click into the bucket for the workflow rather than looking at the top-level workflow.logs for the submission.
I noticed in call-CreatePanel/shard-0/CreatePanel-0.log there was an error message:
A USER ERROR has occurred: Duplicate sample: GTEX-144GM-0226. Sample was found in both gs://fc-secure-a46c7502-d26e-4217-b1d4-7d80a20d7456/a7546857-07a3-4494-b786-be408d1c5301/Mutect2_Panel/bd43a719-fac5-40a3-b9af-136d914758cd/call-Mutect2/shard-22/m2.Mutect2/4f0f30a1-996e-407c-bfa6-a3eb293d893e/call-Filter/GTEX-144GM-0226-SM-E9IK3-filtered.vcf and gs://fc-secure-a46c7502-d26e-4217-b1d4-7d80a20d7456/a7546857-07a3-4494-b786-be408d1c5301/Mutect2_Panel/bd43a719-fac5-40a3-b9af-136d914758cd/call-Mutect2/shard-21/m2.Mutect2/4a2a07e7-0485-4462-861e-b295ef5b03da/call-Filter/GTEX-144GM-0226-SM-DK2JT-filtered.vcf.
This looks to be the error showing in all the CreatePanel shards. Can you confirm whether this points to a configuration issue in the way the workflow is set up? Doing a more thorough search of this workflow on the backend, only the CreatePanel task is showing as having failed.
Kind regards,
Jason
-
Hi Jason,
I deleted one of the 'duplicate' samples (although the sample should not be a duplicate as far as I am aware), but the pipeline still fails. I get no errors flagged by the job manager because of the large amounts of metadata as you explained before.
I searched through the various error files, but I cannot detect a problem - can your team please have a look ? I would appreciate if you can please update by the end of tomorrow so that we can try out next steps as the project is now significantly delayed.The submission ID is d5b5ffc5-9a46-42b4-944e-ebb66eceae55, in the same work space as always.
Many thanks
Mia
-
Hi Mia,
I was looking through the log message for that particular workflow and noticed all shards completed except shard 94. I looked at the working directory for this shard and it looks like its running mutect2, and its divided into further shards. I looked at the log files for a randomly chosen shard and it appears the mutect tool had stopped abruptly (shard 0 log). As mentioned previously if a tool fails in this way it's probably related to not having enough memory or diskspace. I don't have access to view the mutect_pon wdl version you are using but it looks like you can increase the M2 task memory with the M2.mem variable (the last variable below):
I think its using 4 gigs of memory right now so maybe try 8 gigs
-
Thanks Beri, I have tried this and will let you know how it goes.
Many thanks,
Mia
-
Hi Beri,
I increased the memory to "30", but the job still fails, and it again seems that the shard 94 had the same issue as you described it above.
1) What else could I try?
2) Which log file did you refer to when you said you looked over the log for this workflow?
Thanks
Mia
-
1) Looks like the failing task M2 has now completed after the memory increase. Now there is a different task that is failing and its CreatePanel. I'm getting this from the workflow log file (see 2). Looking at one of the task logs for CreatePanel, it looks like the error is resource related again.
Runtime.totalMemory()=1152385024 java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at htsjdk.tribble.index.linear.LinearIndex$ChrIndex.read(LinearIndex.java:295) at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:404) at htsjdk.tribble.index.linear.LinearIndex.<init>(LinearIndex.java:116) at htsjdk.tribble.index.IndexFactory$IndexType$$Lambda$47/902064508.apply(Unknown Source) at htsjdk.tribble.index.IndexFactory$IndexType.createIndex(IndexFactory.java:119) at htsjdk.tribble.index.IndexFactory.createIndex(IndexFactory.java:207) at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:198) at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:183) at htsjdk.tribble.TribbleIndexedFeatureReader.loadIndex(TribbleIndexedFeatureReader.java:163) at htsjdk.tribble.TribbleIndexedFeatureReader.<init>(TribbleIndexedFeatureReader.java:133) at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:121) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromPath(GenomicsDBImport.java:831) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getFeatureReadersSerially(GenomicsDBImport.java:815) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.createSampleToReaderMap(GenomicsDBImport.java:658) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport$$Lambda$93/1886478937.apply(Unknown Source) at org.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:670) at org.genomicsdb.importer.GenomicsDBImporter$$Lambda$97/561133045.get(Unknown Source) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
Again can't say for certain since i don't have access to your version of the wdl, but looks like the original wdl is using 8g of memory for the task so you can try increasing the memory to 16.
2. The workflow log file can be found by clicking on the submission ID, which will then take you to a google bucket with the workflow.logs folder. The folder will contain the workflow log.
Please sign in to leave a comment.
30 comments