Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

MarkDuplicatesSpark consumes enormous amount of RAM

Answered
1

25 comments

  • Avatar
    Bhanu Gandham

    Hi Yangyxt

     

    The reason you are probably facing this issue is because you have not limit memory usage with `-xmx` argument or worker threads(which is possibly causing the use of all the cores) using `--spark-master` argument. Try to set these values and try again and let us know how that works out for you. 

     

    For more info on arguments take a look at the tool docs here: https://gatk.broadinstitute.org/hc/en-us/articles/360040096212-MarkDuplicatesSpark

    1
    Comment actions Permalink
  • Avatar
    Yangyxt

    Dear Bhanu Gandham

    Thanks for the reply, I took a look at the arguments in the doc's page.

    I found that the value of --spark-master is URL str. In my case, I ran MarkDuplicatesSpark on my institutional server. We do not conduct the pipeline on the cloud. Before seeing your reply, I also tried to add one more argument shown below:

    `

    --conf 'spark.executor.cores=3'

    `

    I made multiple trials and adjust the number from 1-5. All ended up with CPU leakage instead of memory leakage. Thus, all jobs are killed by the PBS system. (I allocated 12 vCPU to the job)

    As to the `-xmx` argument, I did not find this one or any arguments similar to the function of this one. If you have a webpage explaining this argument, pls paste the link here. Thanks!

     

     

    1
    Comment actions Permalink
  • 0
    Comment actions Permalink
  • Avatar
    Niyomi House

    Hi,

    Did you figure out what the issue was and how to resolve it? I am having the same exact problem even after using the "Xmx" command. I am not sure what else to try. 

    1
    Comment actions Permalink
  • Avatar
    Udi L

    Hi,

    Did you figure it?

    I have the same problem.

    I am using --java-options "-Xmx80G"

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Udi L,

    How much available memory do you have on your machine? Memory errors can also occur if you try to allocate too much of your available memory to the job using the -Xmx option.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Udi L

    Hi Pamela,

    Thank you for your reply.

    I have 125gb of memory on the machine.

    The bam file is about 32gb.

    The error I get is:

    =>> PBS: job killed: mem 461773976kb exceeded limit 83886080kb

    Its look like the application takes 460gb (!)

    Best regards,

    Udi

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Udi L and Niyomi House,

    It looks like the memory allocation is being complicated by the fact that this is a Spark program, which requires that you specify the number of executors (i.e. the number of spark workers that are spawned) as well as the memory allocation per executor. Here is an example script for doing so:

    gatk MarkDuplicatesSpark \
    -I input.bam \
    -O marked_duplicates.bam \
    -M marked_dup_metrics.txt \
    --conf 'spark.executor.instances=10' \
    --conf 'spark.executor.cores=10' \
    --conf 'spark.executor.memory=8G'

    This example will launch 10 worker processes and use 11 cores (1 for each instance + 1 for the "driver" process). Each worker will have 8 GiB of working memory, plus "overhead" (by default 10% of working memory, or 0.8GiB). The driver will by default have 1 GiB of working memory and 10% of that as overhead. So to run this spark job successfully, you'd want to allocate at least (8 GiB / executor) x (10 executors) x 1.1 + (1 GiB / driver) x (1 driver) x 1.1 = 89.1 GiB  and 11 processes on the cluster.
    Of course, you may want to tune things to be more optimal to your specific workflow: you can alter the number of executors and memory per executor, but other parameters are available (e.g. the amount of overhead memory). You can view the full list of properties here:
    https://spark.apache.org/docs/2.4.0/configuration.html#available-properties
    When spark resource usage is configured correctly, I think you shouldn't need to edit things like java -Xmx. Please let me know if this helps solve your problem.

    Kind regards,

    Pamela

    1
    Comment actions Permalink
  • Avatar
    Udi L

    Hi Pamela,

    Took me some time to rerun..

    I used your instruction and run with the spark paremters etc.

    First I run without the 'Xmx' option with 11 cores and 90gb memory, and I again got the error:

    =>> PBS: job killed: mem 99968784kb exceeded limit 94371840kb

    Than I run again but with '-Xmx90G' and now some of the samples work fine but others fail with the same error. weird..

    I check maybe it a different size of the BAM but doesn't seem to be the case.

    Appreciate your advice

    Best,

    Udi 

     

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Udi L,

    Could you please send the command that you first tried to run without the 'Xmx' option. If you specified 90Gb of memory in the command I posted above, this would be allocating 90Gb of memory to each of the 11 cores which would exceed your total available memory. Could you try running with 11 cores and 8 or 9 Gb of memory?

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Udi L

    Hi Pamela,

    This is the command I used:

     gatk MarkDuplicatesSpark -I {input} -O {output.bam} -OBI False \
    --tmp-dir {temp} --conf 'spark.executor.cores=10' --conf 'spark.executor.instances=10' \
    --conf 'spark.executor.memory=8G' --remove-sequencing-duplicates

     

    When I wrote that I specified 90Gb that was for the cluster.

    I now see that I gave 10 cores to spark (I gave 11 to the cluster) maybe that is the problem.

    Best,

    Udi

     

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Udi L,

    Thank you for clarifying and providing the command. Did you try running it with the same number of cores for spark and the cluster to see if this is the issue?

    0
    Comment actions Permalink
  • Avatar
    Udi L

    Hi Pamela,

    I tried to run with the same number of cores but unfortunately the issue is not resolved.

    Udi

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Udi L,

    I spoke with some other members of the GATK team and the only other thing we can think for you to try is to use even fewer executors to minimize the allocated resources. Could you try keeping the memory allocation the same but specifying only 6 executors? I have also created a Github ticket so that the GATK team can look into the memory allocation issues with MarkDuplicatesSpark which you can follow the progress of here: https://github.com/broadinstitute/gatk/issues/7406

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Udi L

    Hi Pamela,

    Thank you for your effort.

    I ran the program with your suggestions, this is the code:

     gatk MarkDuplicatesSpark -I {input} -O {output}  -OBI False \
    --tmp-dir {tmp} --conf 'spark.executor.cores=6' --conf 'spark.executor.instances=5' \
    --conf 'spark.executor.memory=8G' --remove-sequencing-duplicates

    I allocate 90GB and 11 cores to the cluster.

    There were some improvement in that some samples went trough (took about 12 hours) but more than half still raise the memory allocation error.

    I think I will try another program to do the job.

    Best,

    Udi

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Udi L,

    Thank you for trying my suggestion. I will keep you updated on the Github ticket and hopefully, the GATK team will find a solution to the memory issues soon. 

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Vincent Ye

    Hi there,

     

    Has there been any progress made on this issue? I'm am running into a similar problem with MarkDuplicatesSpark requiring a lot of memory and I will often have PBS kill my jobs, however only until the job has run for 30-40 minutes to reach that memory limit. 

    I am in the going through the pre-processing pipeline for Exome Sequencing Reads.

    I'm runnning GATK4.2.2.0 on our HPC cluster.

    This is my script:

    #PBS -l nodes=1:ppn=11
    #PBS -l gres=localhd:100
    #PBS -l mem=100g
    #PBS -l vmem=100g
    #PBS -l walltime=72:00:00
    #PBS -joe /hpf/projects/dirkslab/People/Vincent/Exome_Aln

    module load gatk

    gatk MarkDuplicatesSpark -I input.bam -O output.bam \
    --tmp-dir path/to/temp \
    --conf 'spark.executor.cores=5' --conf 'spark.executor.instances=1' --conf 'spark.executor.memory=5G' \
    -M Metrics.txt

    From the previous suggestions in this forum I should have enough mem allocated on the cluster to run this job, and some of my sequences are indeed able to run. When it does run the job output file reads: Runtime.totalMemory()=21153447936. So just 21GB were needed. 

     

    On other files where it gets killed by PBS this is the error message:

    Aborted by PBS ServerJob exceeded a memory resource limit (vmem, pvmem, etc.). Job was abortedSee Administrator for helpExit_status=-10resources_used.cput=05:09:22resources_used.vmem=647590228kbresources_used.walltime=00:33:07resources_used.mem=27104220kbresources_used.energy_used=0req_information.task_count.0=1req_information.lprocs.0=11req_information.total_memory.0=157286400kbreq_information.memory.0=157286400kbreq_information.total_swap.0=157286400kbreq_information.swap.0=157286400kbreq_information.thread_usage_policy.0=allowthreadsreq_information.hostlist.0=node450:ppn=11

    I have had to increase the allocated mem and vmem up to 450GB to get jobs done, otherwise they'll run until they reach the limit and then be killed. 

     

    If I add --java-options "-Xmx500G" then I get the following error message:

    Using GATK jar /hpf/tools/centos7/gatk/4.2.2.0/gatk-package-4.2.2.0-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_writ$Error occurred during initialization of VM
    Could not reserve enough space for 524288000KB object heap

    Any tips and ideas would be helpful as this is quite frustrating and difficult to figure out.

     

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Vincent Ye,

    Unfortunately it does not look like we have had a chance to fix this issue yet. I would recommend that you comment on the github ticket with your example so that we can see multiple examples where it occurs.

    Could you share the script details when you are getting the heap size error message with Xmx500? We can look into possible solutions for that error message.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Vincent Ye,

    I have good news! I think we have found an issue with the documentation that is leading to some of these memory issues with MarkDuplicatesSpark.

    Please see this section of our gatk README: https://github.com/broadinstitute/gatk#running-gatk4-spark-tools-locally. You'll want to specify the --spark-master argument to control how many threads Spark will use. You will also want to use the java option -Xmx to limit how much physical memory java can use. Please see the details about the Xmx arguments at these resources below:

    In local mode, the conf arguments are ignored, so that is why they are not working for you currently. 

    Could you try out these options and let us know if it works so that we can update our documentation if so?

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Vincent Ye did this work for you?

    0
    Comment actions Permalink
  • Avatar
    Vincent Ye

    Hi Genevieve,

     

    Unfortunately it did not.

    Here's the script I ran:

    gatk MarkDuplicatesSpark -I path/to/bam -O /path/to/marked.bam --tmp-dir /path/to/temp \
    --conf 'spark.executor.cores=5' --conf 'spark.executor.instances=1' --conf 'spark.executor.memory=8G' --spark-runner LOCAL --spark-master 'local[4]' --java-options "-Xmx80G" \
    -M /path/to/metrics

    I assigned 100G of mem and vmem to this through my HPC cluster but job was killed by PBS:

    PBS Job Id: 75265405Job Name:   MarkDuplicates3GF5RExec host:  node242/0-3Aborted by PBS ServerJob exceeded a memory resource limit (vmem, pvmem, etc.). Job was abortedSee Administrator for helpExit_status=-10resources_used.cput=00:42:16resources_used.vmem=229724788kbresources_used.walltime=00:12:42resources_used.mem=29296476kbresources_used.energy_used=0req_information.task_count.0=1req_information.lprocs.0=4req_information.total_memory.0=104857600kbreq_information.memory.0=104857600kbreq_information.total_swap.0=104857600kbreq_information.swap.0=104857600kbreq_information.thread_usage_policy.0=allowthreadsreq_information.hostlist.0=node242:ppn=4req_information.task_usage.0.task.0={"task":{"cpu_list":"0-3","mem_list":"0","cores":0,"threads":4,"host":"node242"}}

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you for the update Vincent Ye. What is the size of these bams? And could you also verify if they are queryname sorted?

    0
    Comment actions Permalink
  • Avatar
    Vincent Ye

    Hi Genevieve,

    This bam file is 12G. And it should be sorted as I followed the (How to) article to take my Fastq reads to generated mapped BAM via these two tutorials:

    https://gatk.broadinstitute.org/hc/en-us/articles/4403687183515--How-to-Generate-an-unmapped-BAM-from-FASTQ-or-aligned-BAM

     

    https://gatk.broadinstitute.org/hc/en-us/articles/360039568932--How-to-Map-and-clean-up-short-read-sequence-data-efficiently

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for the update Vincent Ye. We aren't able to figure out any overt reason why you would be seeing this. We'll keep looking into it.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Vincent Ye,

    I wanted to update you that we are continuing to look into this issue. Unfortunately, it is really hard for us to troubleshoot since we do not see these same problems on our end. There is one other option you can try for now and let us know if it leads to any improvement: 

    --conf 'spark.kryo.referenceTracking=false'

    Please keep us updated on your end and we will let you know if we have any updates.

    Best,

    Genevieve

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk