GenomicsDBImport running out of memory?
Hi, I am trying to combine whole-genome VCFs from ~1500 samples using GenomicsDBImport using GATK 4.1.8.0. The VCFs are about 87GB in total, and I'm trying to merge the entire genome with no specified intervals (chromosomes 1-22, X, Y, MT). However I keep running into the following out of memory error:
Picked up _JAVA_OPTIONS: -Xmx1025M
10:03:21.490 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/apps/apps/binapps/gatk/4.1.8.0/gatk-package-4.1.8.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Oct 28, 2020 10:03:21 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
10:03:21.661 INFO GenomicsDBImport - ------------------------------------------------------------
10:03:21.661 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.8.0
10:03:21.661 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
10:03:21.661 INFO GenomicsDBImport - Executing as mbax2iw2@node786.pri.csf3.alces.network on Linux v3.10.0-693.17.1.el7.x86_64 amd64
10:03:21.661 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_265-b11
10:03:21.661 INFO GenomicsDBImport - Start Date/Time: 28 October 2020 10:03:21 GMT
10:03:21.661 INFO GenomicsDBImport - ------------------------------------------------------------
10:03:21.661 INFO GenomicsDBImport - ------------------------------------------------------------
10:03:21.662 INFO GenomicsDBImport - HTSJDK Version: 2.22.0
10:03:21.662 INFO GenomicsDBImport - Picard Version: 2.22.8
10:03:21.662 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
10:03:21.662 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
10:03:21.662 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
10:03:21.662 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
10:03:21.662 INFO GenomicsDBImport - Deflater: IntelDeflater
10:03:21.662 INFO GenomicsDBImport - Inflater: IntelInflater
10:03:21.662 INFO GenomicsDBImport - GCS max retries/reopens: 20
10:03:21.662 INFO GenomicsDBImport - Requester pays: disabled
10:03:21.662 INFO GenomicsDBImport - Initializing engine
10:03:22.108 INFO IntervalArgumentCollection - Processing 3088286401 bp from intervals
10:03:22.109 INFO GenomicsDBImport - Done initializing engine
10:03:22.517 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.0-e701905
10:03:22.523 INFO GenomicsDBImport - Vid Map JSON file will be written to /mnt/iusers01/fatpou01/bmh01/mbax2iw2/scratch/gdb/vidmap.json
10:03:22.523 INFO GenomicsDBImport - Callset Map JSON file will be written to /mnt/iusers01/fatpou01/bmh01/mbax2iw2/scratch/gdb/callset.json
10:03:22.523 INFO GenomicsDBImport - Complete VCF Header will be written to /mnt/iusers01/fatpou01/bmh01/mbax2iw2/scratch/gdb/vcfheader.vcf
10:03:22.523 INFO GenomicsDBImport - Importing to workspace - /mnt/iusers01/fatpou01/bmh01/mbax2iw2/scratch/gdb
10:03:22.523 INFO ProgressMeter - Starting traversal
10:03:22.523 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
10:18:25.392 INFO GenomicsDBImport - Shutting down engine
[28 October 2020 10:18:25 GMT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 15.07 minutes.
Runtime.totalMemory()=954728448
java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: Java heap space
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
at htsjdk.tribble.readers.TabixReader.readInt(TabixReader.java:189)
at htsjdk.tribble.readers.TabixReader.readIndex(TabixReader.java:264)
at htsjdk.tribble.readers.TabixReader.readIndex(TabixReader.java:287)
at htsjdk.tribble.readers.TabixReader.<init>(TabixReader.java:165)
at htsjdk.tribble.readers.TabixReader.<init>(TabixReader.java:129)
at htsjdk.tribble.TabixFeatureReader.<init>(TabixFeatureReader.java:80)
at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:117)
at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromPath(GenomicsDBImport.java:833)
at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getFeatureReadersSerially(GenomicsDBImport.java:817)
at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.createSampleToReaderMap(GenomicsDBImport.java:659)
at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport$$Lambda$76/692763171.apply(Unknown Source)
at org.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:699)
at org.genomicsdb.importer.GenomicsDBImporter$$Lambda$80/1890097328.get(Unknown Source)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
... 3 more
Using GATK jar /opt/apps/apps/binapps/gatk/4.1.8.0/gatk-package-4.1.8.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx1400g -XX:-UseGCOverheadLimit -jar /opt/apps/apps/binapps/gatk/4.1.8.0/gatk-package-4.1.8.0-local.jar GenomicsDBImport --genomicsdb-workspace-path scratch/gdb -L chromosomes.list --tmp-dir scratch/tmp --sample-name-map sample_map
As the machine's memory is higher than the combined size of the VCF files by a significant amount, I am rather stumped as to what is causing this issue. Any help would be appreciated.
-
To help with runtime or memory usage, try the following:
-
Verify this issue persists with the latest version of GATK.
-
Specify a --tmp-dir that has room for all necessary temporary files.
-
Specify java memory usage using java option -Xmx.
-
Run the gatk command with the gatk wrapper script command line.
-
Check the depth of coverage of your sample at the area of interest.
-
Check memory/disk space availability on your end.
-
-
Hi, thanks for the reply. I have managed to get it to run by adding a java option to run garbage collection in parallel. Here's the command I'm running right now:
gatk GenomicsDBImport --java-options '-Xmx1024g -XX:+UseConcMarkSweepGC' --genomicsdb-workspace-path scratch/gdb -L chromosomes.list --tmp-dir scratch/tmp --sample-name-map sample_map
I am now having the opposite problem where it seems that GenomicsDBImport is only using ~60gb of memory despite the large maximum allocated. I am unable to check the number of threads it is using due to how my institution's computational cluster is set up, so there's a possibility it's not using all the threads either. I have looked around and it seems to parallelise the process I have to split the import to run on single intervals? Does this work for an entire chromosome?
-
Hi Ivan, glad you were able to get it to run!
If you are running on a cluster, you will want to use this option for GenomicsDB to help with optimization: --genomicsdb-shared-posixfs-optimizations. It was introduced in 4.1.8.0.
Yes, you can split the import to run on chromosomes separately. Please see this resource for more information: Intervals and interval lists
-
Thanks a lot for the help. I have run the import with each chromosome separately. However I am still unclear on how to merge the resulting genomicsDBs together, and was only able to find https://github.com/broadinstitute/gatk/issues/6557, which I don't think I fully understand. Does this mean that I have to merge it manually as in the linked issue? Or is there a tool to do it within gatk?
-
I have also encountered the following error on the runs for most of the chromosomes (some finished with no errors)
[TileDB::utils] Error: (gzip_handle_error) Cannot compress with GZIP: deflateInit error: Z_MEM_ERROR
[TileDB::Codec] Error: Could not compress with .
[TileDB::WriteState] Error: Cannot compress tile.
09:49:57.788 erro NativeGenomicsDB - pid=79689 tid=80068 VariantStorageManagerException exception : Error while writing to TileDB array
TileDB error message : [TileDB::WriteState] Error: Cannot compress tile
terminate called after throwing an instance of 'std::exception'
what(): std::exceptionI am running the import using the following command, importing one chromosome at a time:
gatk GenomicsDBImport --java-options '-XX:ConcGCThreads=1 -Xmx16G -XX:ParallelGCThreads=1 -XX:ParallelCMSThreads=1 -XX:+UseConcMarkSweepGC' --genomicsdb-workspace-path "scratch/gdb${chr}" -L $chr --tmp-dir scratch/tmp --sample-name-map vcf_list --max-num-intervals-to-import-in-parallel 25 --verbosity DEBUG
-
Hi Ivan,
The error message: Cannot compress with GZIP: deflateInit error: Z_MEM_ERROR because of a memory issue with the feature readers, since you are working with many samples. You are using the option --max-num-intervals-to-import-in-parallel 25 which should be decreased so that the jobs can run properly. You can start with 1 and confirm that it works, then double it until you get to a good number to optimize your GenomicsDBImport.
Another argument to consider is to specify the batch size which will control how many feature readers are open at once. The argument is --batch-size and more info can be found in the GenomicsDBImport docs.
The link you pointed to for merging the workspaces is not a supported method, though you may be able to get it to work. We do not provide GATK Support for that, however. Here is another related issue: https://github.com/broadinstitute/gatk/issues/6629
-
Thank you very much for the suggestions, they seem to have helped and I have finally performed the imports successfully on each chromosome separately.
Regarding merging the workspaces, from what I have gathered from the various sources there is no support for adding new samples to the workspaces, but what about merging workspaces with the same list of sources but different intervals?
-
Ivan Glad you were able to get it to work! Yes, it also says in the link you referenced earlier, though it might work to merge workspaces with the same samples but different intervals, it is not one of our supported methods with GATK.
If you want the workspaces to be together, you could re-run GenomicsDBImport with the batch size argument and the other methods you have used to get it to import successfully. The best way to get all intervals in the same workspace would be to import them together.
If other users have any other methods that have worked for them, please chime in here!
Please sign in to leave a comment.
8 comments