GATK MarkDuplicatesSpark output error
AnsweredHello, i am using gatk-4.2.5.0 version for MarkDuplicatesSpark with ~40 gb data but i am facing an error. It does run the command but it does not produce an output file. I've stopped the running process because of the issue, but it is the situation so far.
a) GATK version used: gatk-4.2.5.0
b) Exact command used:
bilgetabak@Bilges-MacBook-Pro-second programs % /Users/bilgetabak/programs/gatk-4.2.5.0/gatk MarkDuplicatesSpark -I /Users/bilgetabak/programs/WGS/CVMsorted.bam -O /Users/bilgetabak/programs/WGS/CVM_markdupspark.bam
Using GATK jar /Users/bilgetabak/programs/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /Users/bilgetabak/programs/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar MarkDuplicatesSpark -I /Users/bilgetabak/programs/WGS/CVMsorted.bam -O /Users/bilgetabak/programs/WGS/CVM_markdupspark.bam
20:11:30.034 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/bilgetabak/programs/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
Mar 11, 2022 8:11:30 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
20:11:30.628 INFO MarkDuplicatesSpark - ------------------------------------------------------------
20:11:30.628 INFO MarkDuplicatesSpark - The Genome Analysis Toolkit (GATK) v4.2.5.0
20:11:30.628 INFO MarkDuplicatesSpark - For support and documentation go to https://software.broadinstitute.org/gatk/
20:11:30.628 INFO MarkDuplicatesSpark - Executing as bilgetabak@BilgesMBPsecond.home on Mac OS X v12.2.1 x86_64
20:11:30.628 INFO MarkDuplicatesSpark - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_321-b07
20:11:30.629 INFO MarkDuplicatesSpark - Start Date/Time: March 11, 2022 8:11:30 PM EET
20:11:30.629 INFO MarkDuplicatesSpark - ------------------------------------------------------------
20:11:30.629 INFO MarkDuplicatesSpark - ------------------------------------------------------------
20:11:30.629 INFO MarkDuplicatesSpark - HTSJDK Version: 2.24.1
20:11:30.629 INFO MarkDuplicatesSpark - Picard Version: 2.25.4
20:11:30.629 INFO MarkDuplicatesSpark - Built for Spark Version: 2.4.5
20:11:30.629 INFO MarkDuplicatesSpark - HTSJDK Defaults.COMPRESSION_LEVEL : 2
20:11:30.629 INFO MarkDuplicatesSpark - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
20:11:30.629 INFO MarkDuplicatesSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
20:11:30.629 INFO MarkDuplicatesSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
20:11:30.629 INFO MarkDuplicatesSpark - Deflater: IntelDeflater
20:11:30.629 INFO MarkDuplicatesSpark - Inflater: IntelInflater
20:11:30.630 INFO MarkDuplicatesSpark - GCS max retries/reopens: 20
20:11:30.630 INFO MarkDuplicatesSpark - Requester pays: disabled
20:11:30.630 INFO MarkDuplicatesSpark - Initializing engine
20:11:30.630 INFO MarkDuplicatesSpark - Done initializing engine
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/03/11 20:11:30 INFO SparkContext: Running Spark version 2.4.5
22/03/11 20:11:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/03/11 20:11:31 INFO SparkContext: Submitted application: MarkDuplicatesSpark
22/03/11 20:11:31 INFO SecurityManager: Changing view acls to: bilgetabak
22/03/11 20:11:31 INFO SecurityManager: Changing modify acls to: bilgetabak
22/03/11 20:11:31 INFO SecurityManager: Changing view acls groups to:
22/03/11 20:11:31 INFO SecurityManager: Changing modify acls groups to:
22/03/11 20:11:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(bilgetabak); groups with view permissions: Set(); users with modify permissions: Set(bilgetabak); groups with modify permissions: Set()
22/03/11 20:11:31 INFO Utils: Successfully started service 'sparkDriver' on port 55175.
22/03/11 20:11:31 INFO SparkEnv: Registering MapOutputTracker
22/03/11 20:11:31 INFO SparkEnv: Registering BlockManagerMaster
22/03/11 20:11:31 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/03/11 20:11:31 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/03/11 20:11:31 INFO DiskBlockManager: Created local directory at /private/var/folders/xp/7ryg036s6gl1r209ppsh06cr0000gn/T/blockmgr-480fec2f-c0b5-4a41-a40b-56491052cfd1
22/03/11 20:11:31 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
22/03/11 20:11:31 INFO SparkEnv: Registering OutputCommitCoordinator
22/03/11 20:11:31 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/03/11 20:11:31 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://bilgesmbpsecond.home:4040
22/03/11 20:11:31 INFO Executor: Starting executor ID driver on host localhost
22/03/11 20:11:31 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55176.
22/03/11 20:11:31 INFO NettyBlockTransferService: Server created on bilgesmbpsecond.home:55176
22/03/11 20:11:31 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/03/11 20:11:31 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, bilgesmbpsecond.home, 55176, None)
22/03/11 20:11:31 INFO BlockManagerMasterEndpoint: Registering block manager bilgesmbpsecond.home:55176 with 2004.6 MB RAM, BlockManagerId(driver, bilgesmbpsecond.home, 55176, None)
22/03/11 20:11:31 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, bilgesmbpsecond.home, 55176, None)
22/03/11 20:11:31 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, bilgesmbpsecond.home, 55176, None)
20:11:32.144 INFO MarkDuplicatesSpark - Spark verbosity set to INFO (see --spark-verbosity argument)
22/03/11 20:11:32 INFO GoogleHadoopFileSystemBase: GHFS version: 1.9.4-hadoop3
22/03/11 20:11:32 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 307.3 KB, free 2004.3 MB)
22/03/11 20:11:33 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 35.4 KB, free 2004.3 MB)
22/03/11 20:11:33 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on bilgesmbpsecond.home:55176 (size: 35.4 KB, free: 2004.6 MB)
22/03/11 20:11:33 INFO SparkContext: Created broadcast 0 from newAPIHadoopFile at PathSplitSource.java:96
22/03/11 20:11:33 INFO BlockManagerInfo: Removed broadcast_0_piece0 on bilgesmbpsecond.home:55176 in memory (size: 35.4 KB, free: 2004.6 MB)
22/03/11 20:11:33 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 307.3 KB, free 2004.3 MB)
22/03/11 20:11:33 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 35.4 KB, free 2004.3 MB)
22/03/11 20:11:33 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on bilgesmbpsecond.home:55176 (size: 35.4 KB, free: 2004.6 MB)
22/03/11 20:11:33 INFO SparkContext: Created broadcast 1 from newAPIHadoopFile at PathSplitSource.java:96
22/03/11 20:11:33 INFO FileInputFormat: Total input files to process : 1
22/03/11 20:11:33 INFO SparkContext: Starting job: sortByKey at SparkUtils.java:165
22/03/11 20:11:33 INFO DAGScheduler: Got job 0 (sortByKey at SparkUtils.java:165) with 1156 output partitions
22/03/11 20:11:33 INFO DAGScheduler: Final stage: ResultStage 0 (sortByKey at SparkUtils.java:165)
22/03/11 20:11:33 INFO DAGScheduler: Parents of final stage: List()
22/03/11 20:11:33 INFO DAGScheduler: Missing parents: List()
22/03/11 20:11:33 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[15] at sortByKey at SparkUtils.java:165), which has no missing parents
22/03/11 20:11:33 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 455.8 KB, free 2003.8 MB)
22/03/11 20:11:33 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 171.7 KB, free 2003.7 MB)
22/03/11 20:11:33 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on bilgesmbpsecond.home:55176 (size: 171.7 KB, free: 2004.4 MB)
22/03/11 20:11:33 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1163
22/03/11 20:11:33 INFO DAGScheduler: Submitting 1156 missing tasks from ResultStage 0 (MapPartitionsRDD[15] at sortByKey at SparkUtils.java:165) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
c) Entire program log:
22/03/11 20:11:33 INFO TaskSchedulerImpl: Adding task set 0.0 with 1156 tasks
22/03/11 20:11:34 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:34 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:34 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:34 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:34 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:34 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:34 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:34 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:34 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
22/03/11 20:11:34 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/03/11 20:11:34 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
22/03/11 20:11:34 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
22/03/11 20:11:34 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
22/03/11 20:11:34 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
22/03/11 20:11:34 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
22/03/11 20:11:34 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
22/03/11 20:11:34 INFO NewHadoopRDD: Input split: file:/Users/bilgetabak/programs/WGS/CVMsorted.bam:167772160+33554432
22/03/11 20:11:34 INFO NewHadoopRDD: Input split: file:/Users/bilgetabak/programs/WGS/CVMsorted.bam:234881024+33554432
22/03/11 20:11:34 INFO NewHadoopRDD: Input split: file:/Users/bilgetabak/programs/WGS/CVMsorted.bam:67108864+33554432
22/03/11 20:11:34 INFO NewHadoopRDD: Input split: file:/Users/bilgetabak/programs/WGS/CVMsorted.bam:134217728+33554432
22/03/11 20:11:34 INFO NewHadoopRDD: Input split: file:/Users/bilgetabak/programs/WGS/CVMsorted.bam:100663296+33554432
22/03/11 20:11:34 INFO NewHadoopRDD: Input split: file:/Users/bilgetabak/programs/WGS/CVMsorted.bam:0+33554432
22/03/11 20:11:34 INFO NewHadoopRDD: Input split: file:/Users/bilgetabak/programs/WGS/CVMsorted.bam:201326592+33554432
22/03/11 20:11:34 INFO NewHadoopRDD: Input split: file:/Users/bilgetabak/programs/WGS/CVMsorted.bam:33554432+33554432
22/03/11 20:11:35 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 61665 bytes result sent to driver
22/03/11 20:11:35 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:35 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
22/03/11 20:11:35 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 61921 bytes result sent to driver
22/03/11 20:11:35 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:35 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 61494 bytes result sent to driver
22/03/11 20:11:35 INFO TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10, localhost, executor driver, partition 10, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:35 INFO Executor: Running task 10.0 in stage 0.0 (TID 10)
22/03/11 20:11:35 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)
22/03/11 20:11:35 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 61723 bytes result sent to driver
22/03/11 20:11:35 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 61820 bytes result sent to driver
22/03/11 20:11:35 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1409 ms on localhost (executor driver) (1/1156)
22/03/11 20:11:35 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 60846 bytes result sent to driver
22/03/11 20:11:35 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 1400 ms on localhost (executor driver) (2/1156)
22/03/11 20:11:35 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 1399 ms on localhost (executor driver) (3/1156)
22/03/11 20:11:35 INFO TaskSetManager: Starting task 11.0 in stage 0.0 (TID 11, localhost, executor driver, partition 11, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:35 INFO TaskSetManager: Starting task 12.0 in stage 0.0 (TID 12, localhost, executor driver, partition 12, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:35 INFO TaskSetManager: Starting task 13.0 in stage 0.0 (TID 13, localhost, executor driver, partition 13, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:35 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 1402 ms on localhost (executor driver) (4/1156)
22/03/11 20:11:35 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1403 ms on localhost (executor driver) (5/1156)
22/03/11 20:11:35 INFO Executor: Running task 11.0 in stage 0.0 (TID 11)
22/03/11 20:11:35 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 61610 bytes result sent to driver
22/03/11 20:11:35 INFO TaskSetManager: Starting task 14.0 in stage 0.0 (TID 14, localhost, executor driver, partition 14, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:35 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 1410 ms on localhost (executor driver) (6/1156)
22/03/11 20:11:35 INFO Executor: Running task 12.0 in stage 0.0 (TID 12)
22/03/11 20:11:35 INFO Executor: Running task 13.0 in stage 0.0 (TID 13)
22/03/11 20:11:35 INFO Executor: Running task 14.0 in stage 0.0 (TID 14)
22/03/11 20:11:35 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 1419 ms on localhost (executor driver) (7/1156)
22/03/11 20:11:35 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 61444 bytes result sent to driver
22/03/11 20:11:35 INFO TaskSetManager: Starting task 15.0 in stage 0.0 (TID 15, localhost, executor driver, partition 15, PROCESS_LOCAL, 7951 bytes)
22/03/11 20:11:35 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 1422 ms on localhost (executor driver) (8/1156)
22/03/11 20:11:35 INFO Executor: Running task 15.0 in stage 0.0 (TID 15)
-
Hi Bilge Tabak,
I don't see any sort of error in this program log. Could you run the program to completion and then post the error?
Best,
Genevieve
-
Hello Genevieve Brandt (she/her),
As you recommended, I run the program one more time and i saw this at the very end of the program:
22/03/13 19:17:12 INFO Executor: Finished task 3696.0 in stage 2.0 (TID 6008). 1783 bytes result sent to driver
22/03/13 19:17:12 INFO TaskSetManager: Finished task 3696.0 in stage 2.0 (TID 6008) in 772 ms on localhost (executor driver) (3698/3698)
22/03/13 19:17:12 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
22/03/13 19:17:12 INFO DAGScheduler: ResultStage 2 (collect at SparkUtils.java:205) finished in 419.557 s
22/03/13 19:17:12 INFO DAGScheduler: Job 1 finished: collect at SparkUtils.java:205, took 778.398099 s
22/03/13 19:17:12 INFO SparkUI: Stopped Spark web UI at http://bilgesmbpsecond.home:4040
22/03/13 19:17:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/03/13 19:17:16 INFO MemoryStore: MemoryStore cleared
22/03/13 19:17:16 INFO BlockManager: BlockManager stopped
22/03/13 19:17:16 INFO BlockManagerMaster: BlockManagerMaster stopped
22/03/13 19:17:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/03/13 19:17:16 INFO SparkContext: Successfully stopped SparkContext
19:17:16.717 INFO MarkDuplicatesSpark - Shutting down engine
[March 13, 2022 7:17:16 PM EET] org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark done. Elapsed time: 14.84 minutes.
Runtime.totalMemory()=4019191808
***********************************************************************
A USER ERROR has occurred: Bad input: Sam file header missing Read Group fields. MarkDuplicatesSpark currently requires reads to be labeled with read group tags, please add read groups tags to your reads
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
22/03/13 19:17:16 INFO ShutdownHookManager: Shutdown hook called
22/03/13 19:17:16 INFO ShutdownHookManager: Deleting directory /private/var/folders/xp/7ryg036s6gl1r209ppsh06cr0000gn/T/spark-4b0c83fa-6bcf-4b12-9faf-a0ababb0b788I guess it couldn't find the reads in the input file. I don't understand why it couldn't find them, it proceeded the run so well.
Normally, GATK advises to use MarkDuplicatesSpark, then MarkDuplicates and SortSam. We've tried the MarkDuplicatesSpark with our very first file which is the alignment of reference genome and our data (~140gb) we couldn't get any output. Then we tried to sort first, the sorted file is about ~40gb (the file that i am mentioning), then run MarkDuplictesSpark but it didn't worked either.
I've run the program with the alignment file one more time before sending this comment to you and again, it is giving the same error:
22/03/13 20:21:47 INFO Executor: Finished task 13323.0 in stage 2.0 (TID 21651). 1790 bytes result sent to driver
22/03/13 20:21:47 INFO TaskSetManager: Finished task 13323.0 in stage 2.0 (TID 21651) in 136 ms on localhost (executor driver) (13324/13325)
22/03/13 20:21:47 INFO Executor: Finished task 13324.0 in stage 2.0 (TID 21652). 1444 bytes result sent to driver
22/03/13 20:21:47 INFO TaskSetManager: Finished task 13324.0 in stage 2.0 (TID 21652) in 139 ms on localhost (executor driver) (13325/13325)
22/03/13 20:21:47 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
22/03/13 20:21:47 INFO DAGScheduler: ResultStage 2 (collect at SparkUtils.java:205) finished in 306.504 s
22/03/13 20:21:47 INFO DAGScheduler: Job 1 finished: collect at SparkUtils.java:205, took 1032.350825 s
22/03/13 20:21:47 INFO SparkUI: Stopped Spark web UI at http://bilgesmbpsecond.home:4040
22/03/13 20:21:47 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/03/13 20:21:53 INFO MemoryStore: MemoryStore cleared
22/03/13 20:21:53 INFO BlockManager: BlockManager stopped
22/03/13 20:21:53 INFO BlockManagerMaster: BlockManagerMaster stopped
22/03/13 20:21:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/03/13 20:21:53 INFO SparkContext: Successfully stopped SparkContext
20:21:53.752 INFO MarkDuplicatesSpark - Shutting down engine
[March 13, 2022 8:21:53 PM EET] org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark done. Elapsed time: 22.94 minutes.
Runtime.totalMemory()=3030908928
***********************************************************************
A USER ERROR has occurred: Bad input: Sam file header missing Read Group fields. MarkDuplicatesSpark currently requires reads to be labeled with read group tags, please add read groups tags to your reads
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
22/03/13 20:21:53 INFO ShutdownHookManager: Shutdown hook called
22/03/13 20:21:53 INFO ShutdownHookManager: Deleting directory /private/var/folders/xp/7ryg036s6gl1r209ppsh06cr0000gn/T/spark-6670b584-02a5-43fb-a2c7-76f4a6015e5eI am stuck at this point unfortunately. I am waiting for your reply.
-
Hi Bilge Tabak,
Thank you for posting this! I can see the issue here - you should easily be able to fix it. The error message is not that MarkDuplicates cannot find the reads, it's that MarkDuplicates cannot find read groups:
A USER ERROR has occurred: Bad input: Sam file header missing Read Group fields. MarkDuplicatesSpark currently requires reads to be labeled with read group tags, please add read groups tags to your reads
Read groups are mandatory for using GATK. Please see this document on read groups for more information about read groups and how to add them:
- https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups
- https://gatk.broadinstitute.org/hc/en-us/articles/360035532352-Errors-about-read-group-RG-information
Please let me know if you have any other questions.
Best,
Genevieve
Please sign in to leave a comment.
3 comments