This document contains information on the best practices and guidelines to follow if you are having questions about GenomicsDB performance.
What checks should I should run when GenomicsDB is too slow?
a. Potential resource bottlenecks for GenomicsDBImport include:
Filesystem latency: If you are using a shared filesystem to store your input gVCF files and/or write the output GenomicsDB workspace, you may encounter issues due to filesystem latency caused by frequent writes to the same files. In this case, we recommend setting
Too many open files and/or heavy memory_usage: Importing a large number of samples, and opening many file handles in parallel, can cause increased memory usage. This can cause slowdown, and in the worst case scenario cause the import to crash. Setting
--batch-sizeto something smaller - say
b. A large number of intervals can cause the import process to be slow, especially if the intervals are small (1,000s or 10,000s of bases). In this case, the import tool will be inefficient because the overhead of opening and closing the gVCF files for each interval will not be amortized by the relatively small amount of variant data contained in the interval. It may help if you are able to use
--merge-input-intervalsto reduce the number of intervals. As of this writing, an upcoming GATK release will include a feature that will substantially boost the niche case where users have many small contigs (>100, although there are cases with over 500k). In this case, the
--merge-input-intervalsflag will not help, and so we suggest using the
--merge-contigs-into-num-partitionsand setting it to something less than 100.
c. In order to speed up GenomicsDB, try using the
--bypass-feature-readeroption. Starting with GATK 220.127.116.11, this option uses a different feature reader for GenomicsDBImport that can lead to a 10-15% increase in speed. It also uses less memory when VCFs and GenomicsDB workspaces are on local disks.
How much physical memory should be allocated to GATK native libraries? What determines how much is needed?
a. A lot of the heavy lifting for the GenomicsDBImport process is done by underlying C/C++ libraries. If you set Java’s Xmx/Xms options to use up all the available physical memory, these C/C++ libraries will run out of memory and cause failures. While the exact memory usage will depend on the number of samples being imported, we would suggest setting the Java Xmx/Xms values to no more than 80% or 90% of the available physical memory.
b. If the
--consolidateparameter is set to
true, the C/C++ libraries may require even more memory than is suggested above (depending on the number of samples in the workspace).
When GenomicsDB is slow, one suggestion is to reduce the batch size and the number of intervals (-L). However, what is an ideal batch size and what is it dependent on? How should I decide what the number and sizes of my intervals should be? How does performance of GDBI vary with these settings?
a. We like batch size of 50. Larger batches need more memory and file handles, but should run faster.
b. The sizes of the intervals do not matter so much as the sheer number. The arg called
--merge-input-intervalswill do the “staging” once, but traverse all the data. In an exome case where HaplotypeCaller was run on the same intervals and there was no data outside those intervals, we found that it worked great. If you simply want to import many positions for GVCFs that have contiguous data between those positions, then it would probably be pretty slow.
What should I do when my data has too many contigs? How many are too many contigs for GenomicsDB?
a. There is no “fall off the cliff” threshold, but as the number of contigs starts to approach/exceed 100, you may start to see issues.
b. An upcoming GATK release will have a
--merge-contigs-into-num-partitionsoption to merge contigs “under the covers”. This is really intended for the case where you might have many fairly small contigs. If you have 24 large contigs, and then 100 small ones that you want to import, you can do one of two things:
“Scatter” the import process so that different contigs/intervals get imported to different workspaces. There doesn’t necessarily need to be one interval per workspace, but this is one way to parallelize the import process. In this case, you could choose to only use the
--merge-contigs-into-num-partitionsparameter for the workspace that has the many small contigs. The parameter can be set to
1in the case that the many small contigs contain little data. If the total amount of data in the small contigs can be roughly quantified as some multiplier of the average amount of data in the other intervals, then set it to that multiplier. For instance, if the 100 contigs have approximately 5x the amount of data as the average intervals from the 24 large contigs, then set
If you do not want to scatter-gather their import-query, then you should set
--merge-contigs-into-num-partitionsto at least
25. This should ensure that all the large contigs are imported separately (i.e., they are not merged), while the small contigs will get merged. In this case, there is no way to specify specific intervals within the large contigs. So all of the data in the large contigs will be imported.
What should I do when they have too many intervals? How many are too many intervals for GenomicsDB?
a. There is no fixed number of intervals that is "too many", but anywhere from hundreds thousands might be problematic. We typically work with exomes containing ~70,000 intervals, which have to be run with merged intervals as above.
How much space should be allocated to the tmp dir?
a. We usually use 50GB. However, for an exceptionally large shard we have used 100GB in the past.