markduplicatesSpark .bam.parts output error (directory already exists)
AnsweredJust to preface, I downloaded gatk into my remote cluster home directory because using the version hosted in our bioinformatic applications directory (v. GATK/4.2.2.0) was failing at setting up the Spark setup.
Also, I ran a previous version that ran out of tmp storage space, so I redirected to a folder /tmpdir I created in my scratch space on the cluster.
a) GATK version used:
gatk 4.2.2.0
b) Exact command used:
I have given it 126 cores with ~252 total GB, which is a whole node on our cluster. This is the first time I have run MarkDuplicatesSpark, so there shouldn't be anything from a previous run.
MDUPES=/scratch/bell/sparks35/GL_Pink_Salmon/data/seqs/aligned_reads_Ogor1.0/mark_dupes
MERGED=/scratch/bell/sparks35/GL_Pink_Salmon/data/seqs/aligned_reads_Ogor1.0/merged_bams
/home/sparks35/gatk-4.2.2.0/gatk --java-options "-Xmx250G -Djava.io.tmpdir=/scratch/bell/sparks35/tmpdir" MarkDuplicatesSpark \
-I $MERGED/LAE_056_Ogor1.0_merged.bam \
-O $MDUPES/LAE_056_Ogor1.0_Sparkdupmarked.bam \
-M $MDUPES/metrics_out/LAE_056_Ogor1.0_Sparkdupmarked_metrics.txt
c) Entire error log:
21/09/30 02:59:26 INFO Executor: Finished task 5801.0 in stage 11.0 (TID 38444). 24472 bytes result sent to driver
21/09/30 02:59:26 INFO TaskSetManager: Finished task 5801.0 in stage 11.0 (TID 38444) in 67296 ms on localhost (executor driver) (5802/5803)
21/09/30 02:59:26 INFO Executor: Finished task 5802.0 in stage 11.0 (TID 38445). 23861 bytes result sent to driver
21/09/30 02:59:26 INFO TaskSetManager: Finished task 5802.0 in stage 11.0 (TID 38445) in 65433 ms on localhost (executor driver) (5803/5803)
21/09/30 02:59:26 INFO TaskSchedulerImpl: Removed TaskSet 11.0, whose tasks have all completed, from pool
21/09/30 02:59:26 INFO DAGScheduler: ResultStage 11 (sortByKey at SparkUtils.java:165) finished in 4047.286 s
21/09/30 02:59:26 INFO DAGScheduler: Job 3 finished: sortByKey at SparkUtils.java:165, took 4047.317590 s
21/09/30 02:59:26 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 13.3 MB, free 133.1 GB)
21/09/30 02:59:26 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 342.3 KB, free 133.1 GB)
21/09/30 02:59:26 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on bell-b006.rcac.purdue.edu:35337 (size: 342.3 KB, free: 133.2 GB)
21/09/30 02:59:26 INFO SparkContext: Created broadcast 12 from broadcast at ReadsSparkSink.java:146
21/09/30 02:59:26 INFO MemoryStore: Block broadcast_13 stored as values in memory (estimated size 13.3 MB, free 133.1 GB)
21/09/30 02:59:26 INFO MemoryStore: Block broadcast_13_piece0 stored as bytes in memory (estimated size 342.3 KB, free 133.1 GB)
21/09/30 02:59:26 INFO BlockManagerInfo: Added broadcast_13_piece0 in memory on bell-b006.rcac.purdue.edu:35337 (size: 342.3 KB, free: 133.2 GB)
21/09/30 02:59:26 INFO SparkContext: Created broadcast 13 from broadcast at BamSink.java:76
21/09/30 02:59:26 INFO SparkUI: Stopped Spark web UI at http://bell-b006.rcac.purdue.edu:4040
21/09/30 02:59:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/09/30 02:59:56 INFO MemoryStore: MemoryStore cleared
21/09/30 02:59:56 INFO BlockManager: BlockManager stopped
21/09/30 02:59:56 INFO BlockManagerMaster: BlockManagerMaster stopped
21/09/30 02:59:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/09/30 02:59:56 INFO SparkContext: Successfully stopped SparkContext
02:59:56.586 INFO MarkDuplicatesSpark - Shutting down engine
[September 30, 2021 2:59:56 AM EDT] org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark done. Elapsed time: 218.72 minutes.
Runtime.totalMemory()=192729317376
***********************************************************************
A USER ERROR has occurred: Couldn't write file /scratch/bell/sparks35/GL_Pink_Salmon/data/seqs/aligned_reads_Ogor1.0/mark_dupes/LAE_056_Ogor1.0_Sparkdupmarked.bam because writing failed with exception Output directory /scratch/bell/sparks35/GL_Pink_Salmon/data/seqs/aligned_reads_Ogor1.0/mark_dupes/LAE_056_Ogor1.0_Sparkdupmarked.bam.parts already exists
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
21/09/30 02:59:56 INFO ShutdownHookManager: Shutdown hook called
21/09/30 02:59:56 INFO ShutdownHookManager: Deleting directory /scratch/bell/sparks35/tmpdir/spark-395262ff-2c6f-4f5e-ac96-159d6f02a13e
I do have a file in the directory called LAE_056_Ogor1.0_dupmarked.bam in the /scratch/bell/sparks35/GL_Pink_Salmon/data/seqs/aligned_reads_Ogor1.0/mark_dupes directory I am outputing into that I ran using picard-tools and MarkDuplicates. That worked fine, but I am trying to parallelize this process so I can run batch jobs on our standby queue that has a wall time of 4:00 hrs and the picard-tools version takes too long for that.
-
Hi Morgan Sparks,
Could you run this command again using the java option -DGATK_STACKTRACE_ON_USER_EXCEPTION=true? You could add it to the java options you are already using.
Best,
Genevieve
-
Genevieve Brandt (she/her) that does seem to have fixed it, should I expect to have to do this with all ___Spark related commands in gatk? Thanks!
-
No, the -DGATK_STACKTRACE_ON_USER_EXCEPTION=true is just an extra command that can help with troubleshooting user errors.
Your problem was most likely a random issue with Spark and your machine. Glad you got it working!
Please sign in to leave a comment.
3 comments