gatk MarkDuplicatesSpark exits without error message
AnsweredIf you are seeing an error, please provide(REQUIRED) :
a) GATK version used: 4.2.0.0 (installed with conda install -c bioconda)
b) Exact command used:
gatk --java-options '-Xmx116G -Xms116G -Djava.io.tmpdir=`pwd`/temp' MarkDuplicatesSpark -I ./adapt_trimmed_bwa_sam/my.bam -M duplicates_marked_and_sorted/my.metrics -O duplicates_marked_and_sorted/my.bam --tmp-dir temp
I'm running this job on an instance with 44 core CPU, 120G memory, and 2TB local storage, with more than 1.5TB free space left. The input bam file is ~105GB. The log file is 221,458 lines long. I searched for errors, but I couldn't find any. The last few lines looks likes this:
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Getting 33 non-empty blocks including 33 local blocks and 0 remote blocks
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
21/07/21 19:55:01 INFO Executor: Finished task 460.0 in stage 6.0 (TID 37732). 1319 bytes result sent to driver
21/07/21 19:55:01 INFO TaskSetManager: Starting task 505.0 in stage 6.0 (TID 37777, localhost, executor driver, partition 505, PROCESS_LOCAL, 8884 bytes)
21/07/21 19:55:01 INFO Executor: Running task 505.0 in stage 6.0 (TID 37777)
21/07/21 19:55:01 INFO TaskSetManager: Finished task 460.0 in stage 6.0 (TID 37732) in 12843 ms on localhost (executor driver) (462/10282)
21/07/21 19:55:01 INFO Executor: Finished task 455.0 in stage 6.0 (TID 37727). 1319 bytes result sent to driver
21/07/21 19:55:01 INFO TaskSetManager: Starting task 506.0 in stage 6.0 (TID 37778, localhost, executor driver, partition 506, PROCESS_LOCAL, 8504 bytes)
21/07/21 19:55:01 INFO TaskSetManager: Finished task 455.0 in stage 6.0 (TID 37727) in 13178 ms on localhost (executor driver) (463/10282)
21/07/21 19:55:01 INFO Executor: Running task 506.0 in stage 6.0 (TID 37778)
21/07/21 19:55:01 INFO Executor: Finished task 463.0 in stage 6.0 (TID 37735). 1319 bytes result sent to driver
21/07/21 19:55:01 INFO TaskSetManager: Starting task 507.0 in stage 6.0 (TID 37779, localhost, executor driver, partition 507, PROCESS_LOCAL, 8881 bytes)
21/07/21 19:55:01 INFO TaskSetManager: Finished task 463.0 in stage 6.0 (TID 37735) in 9741 ms on localhost (executor driver) (464/10282)
21/07/21 19:55:01 INFO Executor: Running task 507.0 in stage 6.0 (TID 37779)
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Getting 33 non-empty blocks including 33 local blocks and 0 remote blocks
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Getting 33 non-empty blocks including 33 local blocks and 0 remote blocks
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Getting 33 non-empty blocks including 33 local blocks and 0 remote blocks
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
21/07/21 19:55:01 INFO Executor: Finished task 464.0 in stage 6.0 (TID 37736). 1319 bytes result sent to driver
21/07/21 19:55:01 INFO TaskSetManager: Starting task 508.0 in stage 6.0 (TID 37780, localhost, executor driver, partition 508, PROCESS_LOCAL, 8507 bytes)
21/07/21 19:55:01 INFO Executor: Running task 508.0 in stage 6.0 (TID 37780)
21/07/21 19:55:01 INFO TaskSetManager: Finished task 464.0 in stage 6.0 (TID 37736) in 9671 ms on localhost (executor driver) (465/10282)
21/07/21 19:55:01 INFO Executor: Finished task 466.0 in stage 6.0 (TID 37738). 1319 bytes result sent to driver
21/07/21 19:55:01 INFO TaskSetManager: Starting task 509.0 in stage 6.0 (TID 37781, localhost, executor driver, partition 509, PROCESS_LOCAL, 8536 bytes)
21/07/21 19:55:01 INFO Executor: Running task 509.0 in stage 6.0 (TID 37781)
21/07/21 19:55:01 INFO TaskSetManager: Finished task 466.0 in stage 6.0 (TID 37738) in 8742 ms on localhost (executor driver) (466/10282)
21/07/21 19:55:01 INFO Executor: Finished task 461.0 in stage 6.0 (TID 37733). 1319 bytes result sent to driver
21/07/21 19:55:01 INFO Executor: Finished task 459.0 in stage 6.0 (TID 37731). 1319 bytes result sent to driver
21/07/21 19:55:01 INFO TaskSetManager: Starting task 510.0 in stage 6.0 (TID 37782, localhost, executor driver, partition 510, PROCESS_LOCAL, 8504 bytes)
21/07/21 19:55:01 INFO Executor: Running task 510.0 in stage 6.0 (TID 37782)
21/07/21 19:55:01 INFO TaskSetManager: Starting task 511.0 in stage 6.0 (TID 37783, localhost, executor driver, partition 511, PROCESS_LOCAL, 8565 bytes)
21/07/21 19:55:01 INFO Executor: Running task 511.0 in stage 6.0 (TID 37783)
21/07/21 19:55:01 INFO TaskSetManager: Finished task 459.0 in stage 6.0 (TID 37731) in 13108 ms on localhost (executor driver) (467/10282)
21/07/21 19:55:01 INFO TaskSetManager: Finished task 461.0 in stage 6.0 (TID 37733) in 12809 ms on localhost (executor driver) (468/10282)
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Getting 33 non-empty blocks including 33 local blocks and 0 remote blocks
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Getting 33 non-empty blocks including 33 local blocks and 0 remote blocks
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Getting 33 non-empty blocks including 33 local blocks and 0 remote blocks
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Getting 33 non-empty blocks including 33 local blocks and 0 remote blocks
21/07/21 19:55:01 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
I also increased the ulimit to 50000. Still, I'm running into the same issue. I repeated the run with a different bam with a similar size, the gatk is exiting at the same point. Does it require more memory? I'd appreciate any pointers. Please let me know if you need more details. Thanks.
-
Hi Ramesh Ramasamy,
You may be allocating too much of your total memory to the job given that you are specifying -Xmx116G with only 120G of total memory. Generally, it is recommended to allocate no more than 80-90% of your available memory to the job. Please let me know if this helps solve the problem.
Kind regards,
Pamela
-
Hi Pamela Bretscher,
Thank you. Yes, it does look like OOMKiller kicked in and killed the job (based on the kernel log: /var/log/kern.log). I allocated ~85% of the total memory to the job and it is running since this morning. The job runs okay so far! I will let you know if I again run into the same issue.
Thanks,
Ramesh
Update: The job ran successfully. Thanks!
-
Hi Ramesh Ramasamy,
Thank you for letting me know, I'm glad to hear that it worked!
Kind regards,
Pamela
Please sign in to leave a comment.
3 comments