GATK MarkDuplicatesSpark output error
AnsweredHello, i am using gatk- version for MarkDuplicatesSpark with ~40 gb data but i am facing an error. It does run the command but it does not produce an output file. I've stopped the running process because of the issue, but it is the situation so far.
a) GATK version used: gatk-
b) Exact command used:
bilgetabak@Bilges-MacBook-Pro-second programs % /Users/bilgetabak/programs/gatk- MarkDuplicatesSpark -I /Users/bilgetabak/programs/WGS/CVMsorted.bam -O /Users/bilgetabak/programs/WGS/CVM_markdupspark.bam
Using GATK jar /Users/bilgetabak/programs/gatk-
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /Users/bilgetabak/programs/gatk- MarkDuplicatesSpark -I /Users/bilgetabak/programs/WGS/CVMsorted.bam -O /Users/bilgetabak/programs/WGS/CVM_markdupspark.bam
Hi Bilge Tabak,
I don't see any sort of error in this program log. Could you run the program to completion and then post the error?
Hello Genevieve Brandt (she/her),
As you recommended, I run the program one more time and i saw this at the very end of the program:
22/03/13 19:17:12 INFO Executor: Finished task 3696.0 in stage 2.0 (TID 6008). 1783 bytes result sent to driver
22/03/13 19:17:12 INFO TaskSetManager: Finished task 3696.0 in stage 2.0 (TID 6008) in 772 ms on localhost (executor driver) (3698/3698)
22/03/13 19:17:12 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
22/03/13 19:17:12 INFO DAGScheduler: ResultStage 2 (collect at finished in 419.557 s
22/03/13 19:17:12 INFO DAGScheduler: Job 1 finished: collect at, took 778.398099 s
22/03/13 19:17:12 INFO SparkUI: Stopped Spark web UI at http://bilgesmbpsecond.home:4040
22/03/13 19:17:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/03/13 19:17:16 INFO MemoryStore: MemoryStore cleared
22/03/13 19:17:16 INFO BlockManager: BlockManager stopped
22/03/13 19:17:16 INFO BlockManagerMaster: BlockManagerMaster stopped
22/03/13 19:17:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/03/13 19:17:16 INFO SparkContext: Successfully stopped SparkContext
19:17:16.717 INFO MarkDuplicatesSpark - Shutting down engine
[March 13, 2022 7:17:16 PM EET] done. Elapsed time: 14.84 minutes.
A USER ERROR has occurred: Bad input: Sam file header missing Read Group fields. MarkDuplicatesSpark currently requires reads to be labeled with read group tags, please add read groups tags to your reads
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
22/03/13 19:17:16 INFO ShutdownHookManager: Shutdown hook called
22/03/13 19:17:16 INFO ShutdownHookManager: Deleting directory /private/var/folders/xp/7ryg036s6gl1r209ppsh06cr0000gn/T/spark-4b0c83fa-6bcf-4b12-9faf-a0ababb0b788I guess it couldn't find the reads in the input file. I don't understand why it couldn't find them, it proceeded the run so well.
Normally, GATK advises to use MarkDuplicatesSpark, then MarkDuplicates and SortSam. We've tried the MarkDuplicatesSpark with our very first file which is the alignment of reference genome and our data (~140gb) we couldn't get any output. Then we tried to sort first, the sorted file is about ~40gb (the file that i am mentioning), then run MarkDuplictesSpark but it didn't worked either.
I've run the program with the alignment file one more time before sending this comment to you and again, it is giving the same error:
22/03/13 20:21:47 INFO Executor: Finished task 13323.0 in stage 2.0 (TID 21651). 1790 bytes result sent to driver
22/03/13 20:21:47 INFO TaskSetManager: Finished task 13323.0 in stage 2.0 (TID 21651) in 136 ms on localhost (executor driver) (13324/13325)
22/03/13 20:21:47 INFO Executor: Finished task 13324.0 in stage 2.0 (TID 21652). 1444 bytes result sent to driver
22/03/13 20:21:47 INFO TaskSetManager: Finished task 13324.0 in stage 2.0 (TID 21652) in 139 ms on localhost (executor driver) (13325/13325)
22/03/13 20:21:47 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
22/03/13 20:21:47 INFO DAGScheduler: ResultStage 2 (collect at finished in 306.504 s
22/03/13 20:21:47 INFO DAGScheduler: Job 1 finished: collect at, took 1032.350825 s
22/03/13 20:21:47 INFO SparkUI: Stopped Spark web UI at http://bilgesmbpsecond.home:4040
22/03/13 20:21:47 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/03/13 20:21:53 INFO MemoryStore: MemoryStore cleared
22/03/13 20:21:53 INFO BlockManager: BlockManager stopped
22/03/13 20:21:53 INFO BlockManagerMaster: BlockManagerMaster stopped
22/03/13 20:21:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/03/13 20:21:53 INFO SparkContext: Successfully stopped SparkContext
20:21:53.752 INFO MarkDuplicatesSpark - Shutting down engine
[March 13, 2022 8:21:53 PM EET] done. Elapsed time: 22.94 minutes.
A USER ERROR has occurred: Bad input: Sam file header missing Read Group fields. MarkDuplicatesSpark currently requires reads to be labeled with read group tags, please add read groups tags to your reads
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
22/03/13 20:21:53 INFO ShutdownHookManager: Shutdown hook called
22/03/13 20:21:53 INFO ShutdownHookManager: Deleting directory /private/var/folders/xp/7ryg036s6gl1r209ppsh06cr0000gn/T/spark-6670b584-02a5-43fb-a2c7-76f4a6015e5eI am stuck at this point unfortunately. I am waiting for your reply.
Hi Bilge Tabak,
Thank you for posting this! I can see the issue here - you should easily be able to fix it. The error message is not that MarkDuplicates cannot find the reads, it's that MarkDuplicates cannot find read groups:
A USER ERROR has occurred: Bad input: Sam file header missing Read Group fields. MarkDuplicatesSpark currently requires reads to be labeled with read group tags, please add read groups tags to your reads
Read groups are mandatory for using GATK. Please see this document on read groups for more information about read groups and how to add them:
Please let me know if you have any other questions.
