MarkDuplicatesSpark consumes enormous amount of RAM
AnsweredDear Officer,
Thank you for bringing MarkDuplicatesSpark here.
I've been using it to deal with targeted sequencing data. It works fine until the last run, in which I use it on a 2-3 times bigger bam file.
I qsub the job to a computation node with 10 CPU cores and 100 GB RAM. And here is the code I conducted:
'
'
Here is the error log screenshot:
According to the error log, it somehow eats more than 400+ GB RAM during processing, while the bam file inputted is just around 3.3Gb.
Pls, let me know your opinion on how to solve this. Thanks!
-
Hi Yangyxt
The reason you are probably facing this issue is because you have not limit memory usage with `-xmx` argument or worker threads(which is possibly causing the use of all the cores) using `--spark-master` argument. Try to set these values and try again and let us know how that works out for you.
For more info on arguments take a look at the tool docs here: https://gatk.broadinstitute.org/hc/en-us/articles/360040096212-MarkDuplicatesSpark
-
Dear Bhanu Gandham
Thanks for the reply, I took a look at the arguments in the doc's page.
I found that the value of --spark-master is URL str. In my case, I ran MarkDuplicatesSpark on my institutional server. We do not conduct the pipeline on the cloud. Before seeing your reply, I also tried to add one more argument shown below:
`
--conf 'spark.executor.cores=3'
`
I made multiple trials and adjust the number from 1-5. All ended up with CPU leakage instead of memory leakage. Thus, all jobs are killed by the PBS system. (I allocated 12 vCPU to the job)
As to the `-xmx` argument, I did not find this one or any arguments similar to the function of this one. If you have a webpage explaining this argument, pls paste the link here. Thanks!
-
-
Hi,
Did you figure out what the issue was and how to resolve it? I am having the same exact problem even after using the "Xmx" command. I am not sure what else to try.
-
Hi,
Did you figure it?
I have the same problem.
I am using --java-options "-Xmx80G"
-
Hi Udi L,
How much available memory do you have on your machine? Memory errors can also occur if you try to allocate too much of your available memory to the job using the -Xmx option.
Kind regards,
Pamela
-
Hi Pamela,
Thank you for your reply.
I have 125gb of memory on the machine.
The bam file is about 32gb.
The error I get is:
=>> PBS: job killed: mem 461773976kb exceeded limit 83886080kb
Its look like the application takes 460gb (!)
Best regards,
Udi
-
Hi Udi L and Niyomi House,
It looks like the memory allocation is being complicated by the fact that this is a Spark program, which requires that you specify the number of executors (i.e. the number of spark workers that are spawned) as well as the memory allocation per executor. Here is an example script for doing so:
gatk MarkDuplicatesSpark \
-I input.bam \
-O marked_duplicates.bam \
-M marked_dup_metrics.txt \
--conf 'spark.executor.instances=10' \
--conf 'spark.executor.cores=10' \
--conf 'spark.executor.memory=8G'This example will launch 10 worker processes and use 11 cores (1 for each instance + 1 for the "driver" process). Each worker will have 8 GiB of working memory, plus "overhead" (by default 10% of working memory, or 0.8GiB). The driver will by default have 1 GiB of working memory and 10% of that as overhead. So to run this spark job successfully, you'd want to allocate at least
(8 GiB / executor) x (10 executors) x 1.1 + (1 GiB / driver) x (1 driver) x 1.1 = 89.1 GiB
and 11 processes on the cluster.
Of course, you may want to tune things to be more optimal to your specific workflow: you can alter the number of executors and memory per executor, but other parameters are available (e.g. the amount of overhead memory). You can view the full list of properties here:
https://spark.apache.org/docs/2.4.0/configuration.html#available-properties
When spark resource usage is configured correctly, I think you shouldn't need to edit things like java -Xmx. Please let me know if this helps solve your problem.Kind regards,
Pamela
-
Hi Pamela,
Took me some time to rerun..
I used your instruction and run with the spark paremters etc.
First I run without the 'Xmx' option with 11 cores and 90gb memory, and I again got the error:
=>> PBS: job killed: mem 99968784kb exceeded limit 94371840kb
Than I run again but with '-Xmx90G' and now some of the samples work fine but others fail with the same error. weird..
I check maybe it a different size of the BAM but doesn't seem to be the case.
Appreciate your advice
Best,
Udi
-
Hi Udi L,
Could you please send the command that you first tried to run without the 'Xmx' option. If you specified 90Gb of memory in the command I posted above, this would be allocating 90Gb of memory to each of the 11 cores which would exceed your total available memory. Could you try running with 11 cores and 8 or 9 Gb of memory?
Kind regards,
Pamela
-
Hi Pamela,
This is the command I used:
gatk MarkDuplicatesSpark -I {input} -O {output.bam} -OBI False \
--tmp-dir {temp} --conf 'spark.executor.cores=10' --conf 'spark.executor.instances=10' \
--conf 'spark.executor.memory=8G' --remove-sequencing-duplicatesWhen I wrote that I specified 90Gb that was for the cluster.
I now see that I gave 10 cores to spark (I gave 11 to the cluster) maybe that is the problem.
Best,
Udi
-
Hi Udi L,
Thank you for clarifying and providing the command. Did you try running it with the same number of cores for spark and the cluster to see if this is the issue?
-
Hi Pamela,
I tried to run with the same number of cores but unfortunately the issue is not resolved.
Udi
-
Hi Udi L,
I spoke with some other members of the GATK team and the only other thing we can think for you to try is to use even fewer executors to minimize the allocated resources. Could you try keeping the memory allocation the same but specifying only 6 executors? I have also created a Github ticket so that the GATK team can look into the memory allocation issues with MarkDuplicatesSpark which you can follow the progress of here: https://github.com/broadinstitute/gatk/issues/7406
Kind regards,
Pamela
-
Hi Pamela,
Thank you for your effort.
I ran the program with your suggestions, this is the code:
gatk MarkDuplicatesSpark -I {input} -O {output} -OBI False \
--tmp-dir {tmp} --conf 'spark.executor.cores=6' --conf 'spark.executor.instances=5' \
--conf 'spark.executor.memory=8G' --remove-sequencing-duplicatesI allocate 90GB and 11 cores to the cluster.
There were some improvement in that some samples went trough (took about 12 hours) but more than half still raise the memory allocation error.
I think I will try another program to do the job.
Best,
Udi
-
Hi Udi L,
Thank you for trying my suggestion. I will keep you updated on the Github ticket and hopefully, the GATK team will find a solution to the memory issues soon.
Kind regards,
Pamela
-
Hi there,
Has there been any progress made on this issue? I'm am running into a similar problem with MarkDuplicatesSpark requiring a lot of memory and I will often have PBS kill my jobs, however only until the job has run for 30-40 minutes to reach that memory limit.
I am in the going through the pre-processing pipeline for Exome Sequencing Reads.
I'm runnning GATK4.2.2.0 on our HPC cluster.
This is my script:
#PBS -l nodes=1:ppn=11
#PBS -l gres=localhd:100
#PBS -l mem=100g
#PBS -l vmem=100g
#PBS -l walltime=72:00:00
#PBS -joe /hpf/projects/dirkslab/People/Vincent/Exome_Aln
module load gatk
gatk MarkDuplicatesSpark -I input.bam -O output.bam \
--tmp-dir path/to/temp \
--conf 'spark.executor.cores=5' --conf 'spark.executor.instances=1' --conf 'spark.executor.memory=5G' \
-M Metrics.txtFrom the previous suggestions in this forum I should have enough mem allocated on the cluster to run this job, and some of my sequences are indeed able to run. When it does run the job output file reads: Runtime.totalMemory()=21153447936. So just 21GB were needed.
On other files where it gets killed by PBS this is the error message:
Aborted by PBS Server
Job exceeded a memory resource limit (vmem, pvmem, etc.). Job was aborted See Administrator for help Exit_status=-10 resources_used.cput=05:09:22 resources_used.vmem=647590228kb resources_used.walltime=00:33:07 resources_used.mem=27104220kb resources_used.energy_used=0 req_information.task_count.0=1 req_information.lprocs.0=11 req_information.total_memory.0=157286400kb req_information.memory.0=157286400kb req_information.total_swap.0=157286400kb req_information.swap.0=157286400kb req_information.thread_usage_policy.0=allowthreads req_information.hostlist.0=node450:ppn=11I have had to increase the allocated mem and vmem up to 450GB to get jobs done, otherwise they'll run until they reach the limit and then be killed.
If I add --java-options "-Xmx500G" then I get the following error message:
Using GATK jar /hpf/tools/centos7/gatk/4.2.2.0/gatk-package-4.2.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_writ$Error occurred during initialization of VM
Could not reserve enough space for 524288000KB object heapAny tips and ideas would be helpful as this is quite frustrating and difficult to figure out.
Thanks!
-
Hi Vincent Ye,
Unfortunately it does not look like we have had a chance to fix this issue yet. I would recommend that you comment on the github ticket with your example so that we can see multiple examples where it occurs.
Could you share the script details when you are getting the heap size error message with Xmx500? We can look into possible solutions for that error message.
Best,
Genevieve
-
Hi Vincent Ye,
I have good news! I think we have found an issue with the documentation that is leading to some of these memory issues with MarkDuplicatesSpark.
Please see this section of our gatk README: https://github.com/broadinstitute/gatk#running-gatk4-spark-tools-locally. You'll want to specify the --spark-master argument to control how many threads Spark will use. You will also want to use the java option -Xmx to limit how much physical memory java can use. Please see the details about the Xmx arguments at these resources below:
- https://gatk.broadinstitute.org/hc/en-us/articles/360035532372-Java-is-using-too-many-resources-threads-memory-or-CPU-
- https://gatk.broadinstitute.org/hc/en-us/articles/360035531892-GATK4-command-line-syntax
In local mode, the conf arguments are ignored, so that is why they are not working for you currently.
Could you try out these options and let us know if it works so that we can update our documentation if so?
Best,
Genevieve
-
Vincent Ye did this work for you?
-
Hi Genevieve,
Unfortunately it did not.
Here's the script I ran:
gatk MarkDuplicatesSpark -I path/to/bam -O /path/to/marked.bam --tmp-dir /path/to/temp \
--conf 'spark.executor.cores=5' --conf 'spark.executor.instances=1' --conf 'spark.executor.memory=8G' --spark-runner LOCAL --spark-master 'local[4]' --java-options "-Xmx80G" \
-M /path/to/metricsI assigned 100G of mem and vmem to this through my HPC cluster but job was killed by PBS:
PBS Job Id: 75265405
Job Name: MarkDuplicates3GF5R Exec host: node242/0-3 Aborted by PBS Server Job exceeded a memory resource limit (vmem, pvmem, etc.). Job was aborted See Administrator for help Exit_status=-10 resources_used.cput=00:42:16 resources_used.vmem=229724788kb resources_used.walltime=00:12:42 resources_used.mem=29296476kb resources_used.energy_used=0 req_information.task_count.0=1 req_information.lprocs.0=4 req_information.total_memory.0=104857600kb req_information.memory.0=104857600kb req_information.total_swap.0=104857600kb req_information.swap.0=104857600kb req_information.thread_usage_policy.0=allowthreads req_information.hostlist.0=node242:ppn=4 req_information.task_usage.0.task.0={"task":{"cpu_list":"0-3","mem_list":"0","cores":0,"threads":4,"host":"node242"}} -
Thank you for the update Vincent Ye. What is the size of these bams? And could you also verify if they are queryname sorted?
-
Hi Genevieve,
This bam file is 12G. And it should be sorted as I followed the (How to) article to take my Fastq reads to generated mapped BAM via these two tutorials:
-
Thanks for the update Vincent Ye. We aren't able to figure out any overt reason why you would be seeing this. We'll keep looking into it.
-
Hi Vincent Ye,
I wanted to update you that we are continuing to look into this issue. Unfortunately, it is really hard for us to troubleshoot since we do not see these same problems on our end. There is one other option you can try for now and let us know if it leads to any improvement:
--conf 'spark.kryo.referenceTracking=false'
Please keep us updated on your end and we will let you know if we have any updates.
Best,
Genevieve
Please sign in to leave a comment.
25 comments