Markduplicates
Hi,
I am trying to mark duplicates using the following command:
java -Xmx20g -jar /software/eucleia/Picard_v2.27.1/picard.jar MarkDuplicates INPUT=Sample_10-LOP-083.bam VALIDATION_STRINGENCY=LENIENT OUTPUT=test.dedup.bam METRICS_FILE=test
which works fine for small bam files, but not for files >15 GB. For large bam files, no output files are created!
Can you please help me with this?
thanks,
Atal
-
Atal Saha. It would be helpful if you could post the log from the failing run. Were there any warnings or errors reported? Your command looks ok to me so we without more information it's hard to say what's wrong.
Also, could you describe your computer, i.e. what OS and hardware you're running?
-
Hi Louis,
Thanks for your reply on this.
The problem was initially solved by increasing Xmx20g to Xmx160g. However, my files are around 20GB each (and 60files in total) and so taking forever to finish this step, So, I choose to use markduplicatesspark in stead (as I hope this will speed it up). But, having similar problem!
My command now is:
/software/eucleia/gatk-4.2.6.1/gatk --java-options "-Xmx260G" MarkDuplicatesSpark -I Sample_13-LOP-089.bam -O Sample_13-LOP-089.dedup.bam -M Sample_13-LOP-089.bam.metrics --tmp-dir tempor/
Using GATK jar /gpfs/gpfs0/software/rhel7/eucleia/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx260G -jar /gpfs/gpfs0/software/rhel7/eucleia/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar MarkDuplicatesSpark -I Sample_13-LOP-089.bam -O Sample_13-LOP-089.dedup.bam -M Sample_13-LOP-089.bam.metrics --tmp-dir tempor/
18:12:56.443 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/gpfs0/software/rhel7/eucleia/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
18:12:56.607 INFO MarkDuplicatesSpark - ------------------------------------------------------------
18:12:56.607 INFO MarkDuplicatesSpark - The Genome Analysis Toolkit (GATK) v4.2.6.1
18:12:56.607 INFO MarkDuplicatesSpark - For support and documentation go to https://software.broadinstitute.org/gatk/
18:12:56.608 INFO MarkDuplicatesSpark - Executing as a20270@eucleia.hi.no on Linux v3.10.0-1160.66.1.el7.x86_64 amd64
18:12:56.608 INFO MarkDuplicatesSpark - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_352-b08
18:12:56.608 INFO MarkDuplicatesSpark - Start Date/Time: January 10, 2023 6:12:56 PM CET
18:12:56.608 INFO MarkDuplicatesSpark - ------------------------------------------------------------
18:12:56.608 INFO MarkDuplicatesSpark - ------------------------------------------------------------
18:12:56.609 INFO MarkDuplicatesSpark - HTSJDK Version: 2.24.1
18:12:56.609 INFO MarkDuplicatesSpark - Picard Version: 2.27.1
18:12:56.609 INFO MarkDuplicatesSpark - Built for Spark Version: 2.4.5
18:12:56.609 INFO MarkDuplicatesSpark - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:12:56.609 INFO MarkDuplicatesSpark - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:12:56.609 INFO MarkDuplicatesSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:12:56.609 INFO MarkDuplicatesSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:12:56.609 INFO MarkDuplicatesSpark - Deflater: IntelDeflater
18:12:56.609 INFO MarkDuplicatesSpark - Inflater: IntelInflater
18:12:56.609 INFO MarkDuplicatesSpark - GCS max retries/reopens: 20
18:12:56.610 INFO MarkDuplicatesSpark - Requester pays: disabled
18:12:56.610 INFO MarkDuplicatesSpark - Initializing engine
18:12:56.610 INFO MarkDuplicatesSpark - Done initializing engine
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/01/10 18:12:56 INFO SparkContext: Running Spark version 2.4.5
23/01/10 18:12:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/10 18:12:57 INFO SparkContext: Submitted application: MarkDuplicatesSpark
23/01/10 18:12:57 INFO SecurityManager: Changing view acls to: a20270
23/01/10 18:12:57 INFO SecurityManager: Changing modify acls to: a20270
23/01/10 18:12:57 INFO SecurityManager: Changing view acls groups to:
23/01/10 18:12:57 INFO SecurityManager: Changing modify acls groups to:
23/01/10 18:12:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(a20270); groups with view permissions: Set(); users with modify permissions: Set(a20270); groups with modify permissions: Set()
23/01/10 18:12:57 INFO Utils: Successfully started service 'sparkDriver' on port 34676.
23/01/10 18:12:57 INFO SparkEnv: Registering MapOutputTracker
23/01/10 18:12:57 INFO SparkEnv: Registering BlockManagerMaster
23/01/10 18:12:57 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/01/10 18:12:57 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/01/10 18:12:57 INFO DiskBlockManager: Created local directory at /gpfs/gpfs0/scratch/Ecogenome/reSultS/bam/tempor/blockmgr-b53c74cd-cce0-4b06-8e38-dd30b945a257
23/01/10 18:12:57 INFO MemoryStore: MemoryStore started with capacity 138.5 GB
23/01/10 18:12:57 INFO SparkEnv: Registering OutputCommitCoordinator
23/01/10 18:12:57 INFO Utils: Successfully started service 'SparkUI' on port 4040.
23/01/10 18:12:57 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://eucleia.hi.no:4040
23/01/10 18:12:57 INFO Executor: Starting executor ID driver on host localhost
23/01/10 18:12:57 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35003.
23/01/10 18:12:57 INFO NettyBlockTransferService: Server created on eucleia.hi.no:35003
23/01/10 18:12:58 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/01/10 18:12:58 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, eucleia.hi.no, 35003, None)
23/01/10 18:12:58 INFO BlockManagerMasterEndpoint: Registering block manager eucleia.hi.no:35003 with 138.5 GB RAM, BlockManagerId(driver, eucleia.hi.no, 35003, None)
23/01/10 18:12:58 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, eucleia.hi.no, 35003, None)
23/01/10 18:12:58 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, eucleia.hi.no, 35003, None)
18:12:58.216 INFO MarkDuplicatesSpark - Spark verbosity set to INFO (see --spark-verbosity argument)
23/01/10 18:12:58 INFO GoogleHadoopFileSystemBase: GHFS version: 1.9.4-hadoop3
23/01/10 18:21:28 WARN NettyRpcEnv: Ignored failure: java.util.concurrent.TimeoutException: Cannot receive any reply from eucleia.hi.no:34676 in 10000 milliseconds
23/01/10 18:21:28 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:846)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:875)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:875)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:875)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:875)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
23/01/10 18:35:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 390.4 KB, free 138.5 GB)
23/01/10 18:35:27 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 35.4 KB, free 138.5 GB)
23/01/10 18:35:27 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on eucleia.hi.no:35003 (size: 35.4 KB, free: 138.5 GB)
23/01/10 18:35:27 INFO SparkContext: Created broadcast 0 from newAPIHadoopFile at PathSplitSource.java:96
23/01/10 18:37:41 INFO BlockManagerInfo: Removed broadcast_0_piece0 on eucleia.hi.no:35003 in memory (size: 35.4 KB, free: 138.5 GB)
23/01/10 18:56:48 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 390.4 KB, free 138.5 GB)
23/01/10 18:56:48 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 35.4 KB, free 138.5 GB)
23/01/10 18:56:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on eucleia.hi.no:35003 (size: 35.4 KB, free: 138.5 GB)
23/01/10 18:56:48 INFO SparkContext: Created broadcast 1 from newAPIHadoopFile at PathSplitSource.java:96
23/01/10 18:58:33 INFO SparkUI: Stopped Spark web UI at http://eucleia.hi.no:4040
23/01/10 18:58:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/01/10 18:58:33 INFO MemoryStore: MemoryStore cleared
23/01/10 18:58:33 INFO BlockManager: BlockManager stopped
23/01/10 18:58:33 INFO BlockManagerMaster: BlockManagerMaster stopped
23/01/10 18:58:33 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/01/10 18:58:33 INFO SparkContext: Successfully stopped SparkContext
18:58:33.175 INFO MarkDuplicatesSpark - Shutting down engine
[January 10, 2023 6:58:33 PM CET] org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark done. Elapsed time: 45.61 minutes.
Runtime.totalMemory()=68833247232
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
23/01/10 18:58:33 INFO ShutdownHookManager: Shutdown hook called
23/01/10 18:58:33 INFO ShutdownHookManager: Deleting directory /gpfs/gpfs0/scratch/Ecogenome/reSultS/bam/tempor/spark-42676cb5-485b-4bb6-b07e-b3021bf03ceeI am using a linux server with 72 cores, 504 gb memory. Any suggestion, please?
thanks,
Atal
-
I wouldn't have expected to need more than 20gb for regular MarkDuplicates at all. That seems strange to me.
A few questions:
1. When normal MarkDuplicates fail does it just take forever and never finish, or does it exit with an error? Or worse, a silent exit with no clear error...
2. What sort of data are you running? Human whole genomes?
3. How are your inputs sorted? By coordinate or queryname?
MarkDuplicatesSpark can be substantially faster but it also be pretty temperamental It's very sensitive to cores/memory allocations and depends on a lot on the way you're running it.
A few details about that.
1. Spark can either run in local mode as a standalone single process or on a managed spark cluster with something like YARN. The way you're running it is as a single process. We find that it doesn't scale that well with a single process past around 8 cores, so I would recommend restricting it to that.
--spark-runner LOCAL --spark-master 'local[8]'
It can run faster and more parallel as a multiprocess spark job but setting up a spark cluster can be an ordeal.
It's EXTREMELY sensitive to the latency of the TMP disk it's writing too, so you basically have to use a locally attached SSD or better. Network drives do not work well. I would double check that.
You're probably best off running multiple slower MarkDuplicates(Spark) jobs in parallel assuming you have sufficient disk bandwidth rather than trying to accelerate MarkDuplicatesSpark using all 72 cores.
-
Thanks again, Louis!
Answer to your questions:
1. it was silent exit with no clear error
2. Its monkfish data, whole genomes
3. inputs sorted by coordinates
I have just given it up with Spark, as it seems impossible to run
markDuplicates is runing now with 160GB space and I am running it for 7 files in parallel. I have total 60 files, will let you know how it goes..
Please let me know if you have further suggestions!
Atal
-
It's possible to get spark running well but it's definitely much more of a pain than I would like it to be. If you're data is in google cloud you can probably get it running on dataproc more easily but I think running standard MD is probably your fastest solution.
I don't know how your compute is set up exactly, but you can problems if you set Xmx to be exactly as large as your memory allocation is on your job. You need to leave some amount of memory left over for non java memory, sometimes that can cause it to look like you need more memory than you actually do.
I wonder if there's anything special about monkfish genomes or the sequencing that could make it behave strangely. The only thing can think is if there is somehow an extremely high duplication rate maybe there's some bug with non-linear memory usage for a duplicate set or something. I'm not sure why that would be though.
Good luck. Sounds like it might be a bit of a slog but should get done eventually! I'm sorry I don't have much more insight into why it's using so much memory.
Please sign in to leave a comment.
5 comments