Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomicsDBImport running out of memory?

0

8 comments

  • Avatar
    Genevieve Brandt

    To help with runtime or memory usage, try the following:

    1. Verify this issue persists with the latest version of GATK.

    2. Specify a --tmp-dir that has room for all necessary temporary files.

    3. Specify java memory usage using java option -Xmx.

    4. Run the gatk command with the gatk wrapper script command line.

    5. Check the depth of coverage of your sample at the area of interest.

    6. Check memory/disk space availability on your end.

    1
    Comment actions Permalink
  • Avatar
    Ivan

    Hi, thanks for the reply. I have managed to get it to run by adding a java option to run garbage collection in parallel. Here's the command I'm running right now:

    gatk GenomicsDBImport --java-options '-Xmx1024g -XX:+UseConcMarkSweepGC' --genomicsdb-workspace-path scratch/gdb -L chromosomes.list --tmp-dir scratch/tmp --sample-name-map sample_map

    I am now having the opposite problem where it seems that GenomicsDBImport is only using ~60gb of memory despite the large maximum allocated. I am unable to check the number of threads it is using due to how my institution's computational cluster is set up, so there's a possibility it's not using all the threads either. I have looked around and it seems to parallelise the process I have to split the import to run on single intervals? Does this work for an entire chromosome?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Hi Ivan, glad you were able to get it to run!

    If you are running on a cluster, you will want to use this option for GenomicsDB to help with optimization: --genomicsdb-shared-posixfs-optimizations. It was introduced in 4.1.8.0. 

    Yes, you can split the import to run on chromosomes separately. Please see this resource for more information: Intervals and interval lists

    0
    Comment actions Permalink
  • Avatar
    Ivan

    Thanks a lot for the help. I have run the import with each chromosome separately. However I am still unclear on how to merge the resulting genomicsDBs together, and was only able to find https://github.com/broadinstitute/gatk/issues/6557, which I don't think I fully understand. Does this mean that I have to merge it manually as in the linked issue? Or is there a tool to do it within gatk?

    0
    Comment actions Permalink
  • Avatar
    Ivan

    I have also encountered the following error on the runs for most of the chromosomes (some finished with no errors)

    [TileDB::utils] Error: (gzip_handle_error) Cannot compress with GZIP: deflateInit error: Z_MEM_ERROR
    [TileDB::Codec] Error: Could not compress with .
    [TileDB::WriteState] Error: Cannot compress tile.
    09:49:57.788 erro NativeGenomicsDB - pid=79689 tid=80068 VariantStorageManagerException exception : Error while writing to TileDB array
    TileDB error message : [TileDB::WriteState] Error: Cannot compress tile
    terminate called after throwing an instance of 'std::exception'
    what(): std::exception

    I am running the import using the following command, importing one chromosome at a time:

    gatk GenomicsDBImport --java-options '-XX:ConcGCThreads=1 -Xmx16G -XX:ParallelGCThreads=1 -XX:ParallelCMSThreads=1 -XX:+UseConcMarkSweepGC' --genomicsdb-workspace-path "scratch/gdb${chr}" -L $chr --tmp-dir scratch/tmp --sample-name-map vcf_list --max-num-intervals-to-import-in-parallel 25 --verbosity DEBUG
    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Hi Ivan,

    The error message: Cannot compress with GZIP: deflateInit error: Z_MEM_ERROR because of a memory issue with the feature readers, since you are working with many samples. You are using the option --max-num-intervals-to-import-in-parallel 25 which should be decreased so that the jobs can run properly. You can start with 1 and confirm that it works, then double it until you get to a good number to optimize your GenomicsDBImport.

    Another argument to consider is to specify the batch size which will control how many feature readers are open at once. The argument is --batch-size and more info can be found in the GenomicsDBImport docs. 

    The link you pointed to for merging the workspaces is not a supported method, though you may be able to get it to work. We do not provide GATK Support for that, however. Here is another related issue: https://github.com/broadinstitute/gatk/issues/6629

    0
    Comment actions Permalink
  • Avatar
    Ivan

    Thank you very much for the suggestions, they seem to have helped and I have finally performed the imports successfully on each chromosome separately.

     

    Regarding merging the workspaces, from what I have gathered from the various sources there is no support for adding new samples to the workspaces, but what about merging workspaces with the same list of sources but different intervals?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Ivan Glad you were able to get it to work! Yes, it also says in the link you referenced earlier, though it might work to merge workspaces with the same samples but different intervals, it is not one of our supported methods with GATK. 

    If you want the workspaces to be together, you could re-run GenomicsDBImport with the batch size argument and the other methods you have used to get it to import successfully. The best way to get all intervals in the same workspace would be to import them together.

    If other users have any other methods that have worked for them, please chime in here!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk