Speeding up GenotypeGVCFS? GATK4
AnsweredI'm currently in the process of creating my own dataset of known variants, since one doesn't exist generated from WGS for my study species (bumblebee). However, the step of performing joint genotyping with GenotypeGVCFs is taking a really long time (16 days!) and I would like to speed up this process. I have read in this forum about multithreading or parallelise the job by running one chromosome at a time. However I don't know how to write that code.
My current code is:
gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R /proj/snic2020-16-43/ref/Bombus_terrestris.Bter_1.0.dna.toplevel.fa \
-V gendb://WorkspaceDBImport \
-O GenotypeGVCFs_output.vcf.gz
I would appreciate any help I can get with running this job faster.
Thanks!
-
Cecilia Kardum Hjort GenotypeGVCFs can run slowly because the GenomicsDB has to be loaded in memory. If it takes up too much RAM things can run very slowly. You will probably get things to run faster just by serializing over different chromosomes. To do this, use the -L <chromosome> argument for GenomicsDBImport and GenotypeGVCFs. That is (in bash):
for n in {1..19}; do
gatk GenomicsDBImport -L $n <rest of command same as before>
gatk GenotypeGVCFs -L $n <rest of command same as before>
done
Of course, you could also parallelize by farming out the different jobs on a cluster, or on Terra (https://app.terra.bio/#workspaces/help-gatk/GATK4-Germline-Preprocessing-VariantCalling-JointCalling/workflows/broad-firecloud-dsde/1-4-Joint-Genotyping-HG38)
-
David Benjamin Thank for your reply! Could I run only the GenomcisGVCF with the -L option because the GenomicsDB has already been created?
-
Unfortunately, I think the answer is no. The problem is that `-L` reduces CPU cost by running over only the selected intervals, but it doesn't address the memory issue of loading the whole `GenomicsDB`. (Someone who is more of an expert may chime in and tell me I'm wrong and that `-L` actually results in only a subset of the `GenomicsDB` being put into RAM, but I don't believe the DB is that smart about `-L`).
By the way, the reason memory issues can cause runtime problems is that the system starts to spend a lot of time just finding available spots in memory as RAM becomes saturated. In extreme cases the computer spends inordinate amounts of time moving data back and forth from RAM to hard disk trying to use the disk as virtual memory. This is called thrashing.
If breaking up both tasks into chromosome chunks doesn't help let us know. It's possible that the bumblebee genome has an edge case we haven't encountered before.
-
Based on some runs with GATK 4.2.0.0, it does seem to use a lot of memory no matter the size of the interval and it correlates with the size of the chromosome. The memory used increases as the chromosome size increases when the genomicsDBs are split by chr 1-22. GenotypeGVCFs an interval of less than 1 mb still uses 60-120 GB depending on the chromosome for 4000 samples. Instead of being able to run 16 in parallel on a 196gb compute node our pipeline can only run 2 or 3 jobs because of the memory requirements. Why can't GenotypeGVCFs minimize memory requirements by reading in only what is needed for the joint-genotyping of the region?
-
Hi jfarrell,
There are some previous forum posts that discuss ways to work around the memory allocation that is necessary for reading a GenomicsDB into GenotypeGVCFs such as this one: https://gatk.broadinstitute.org/hc/en-us/community/posts/360074780432-Speeding-up-GenotypeGVCFs-from-genomicsdb
Additionally, this document contains some helpful tips for speeding up GenomicsDB which also apply to running GenotypeGVCFs from a GenomicsDB workspace: https://gatk.broadinstitute.org/hc/en-us/articles/360056138571-GDBI-usage-and-performance-guidelines
I hope you find this information helpful.
Kind regards,
Pamela
-
These are useful suggestions on speeding up GenotypwGVCFs which I am using. None of them address the large amount of memory used no matter how small the target size that is specified. GenotypeGVCFs spends a lot of time initializing with IO and using a lot of memory before it even starts calling variants. The large amount of memory required impacts the ability to parallelize the jobs on HPCC clusters. What is going on during the GenotypeGVCFs initialization that requires so much I/O and memory?
-
Hi jfarrell,
The size of the intervals specified does not matter as much as the number of intervals that you when determining how much memory is used. The total amount of data that you are inputting to GenotypeGVCFs is likely causing the large amount of memory needed to initialize the tool. Are you still able to get it to run successfully despite the memory requirements?
Kind regards,
Pamela
-
It runs to completion if I give the job a lot of memory (80 -120 gb) and target a small interval for the genomicsDB with 2000 samples for each chromosome.
The output vcf.gz is only 55M for only a chr3-5882072-5975713.vcf.gz.
Unzipped it is only 350 MB.
-rw-r--r-- 1 farrell kageproj 336M Jul 22 16:42 unzipped.vcfThe memory used is way out of proportion to the actual size of the vcf being generated. Why is so much memory being used? GenotypeGVCFs is extremely memory inefficient and limits the number of parallel jobs that can be run.
Are there any benchmarks describing the memory/cpu requirements for various cohort sizes to run Dragen/GATK?
-
Hi jfarrell,
There are several factors that contribute to the memory that needs to be used other than the size of the output file. For example, it takes a greater amount of memory to genotype sites with many alternate alleles or genotype counts. The memory allocation, therefore, has more to do with the data the tool needs to read and analyze rather than the information it outputs. There is information about some optional arguments that might help reduce memory usage in the GenotypeGVCFs tool documentation (--max-genotype-count, --max-alternate-alleles, --disable-bam-index-caching).
I'm not currently aware of a document that lists the memory requirements for different cohort sizes, I would imagine that this depends on many other factors including the machine you are running it on. I'm sorry that I don't have more useful information for you, but I'm glad that you were still able to get the job to run to completion.
Kind regards,
Pamela
-
Sorry for bumping an old thread, but I am facing similar problems with the memory requirements and inability to scatter across multiple smaller intervals due to server hardware limitations.
I was going to recommend some method of caching the input gvcf files in memory, akin to STAR aligners
LoadAndExit
option, where a copy of reference genome is cached in memory to make subsequent alignments quicker. However, since there are other calculations that are taking up the memory, I'm not sure how useful this approach would be. -
Thanks for the post kvn95ss. We don't think this caching would improve the memory problems, because at this point we don't even load the GVCFs, just the bits we are working with. We already do something like this with the reference.
More caching probably wouldn't help memory anyway, it might make the memory problems worse. Though it may help the performance. Are your main issues with performance or memory?
-
This is a potential solution that I will be testing out with my next batch of crams. There is now support in the 4.2.6.1
CombineGVCFs
andGenotypeGVCFs
for "reblocked" GVCFs as produced by theReblockGVCF
tool. Reblocked GVCFs have a significantly reduced storage footprint. I expect the reduced size will help with both the speed and memory requirements for the joint genotyping. For one sample's chr1 gvcf, the g.vcf.gz dropped from 827mb to 134mb in the reblocked g.vcf.gz. Hopefully that smaller file size will translate into less memory, i/o and computer time for the genotypeGVCFs step. -
Good point jfarrell! Yes, this can really help.
-
jfarrell and Genevieve Brandt (she/her) using normal gvcf, creating genomics db using genomicsdb and genotypegvcf used to take ~24 hours (around 1.9k samples, split across each chr, while chr1 and 2 where further split in half).
With the same setup, but using reblock vcf and bypass-feature reader, the time went down to ~12 hours! That's 50% reduction in time!
But as Kylo Ren said in Star Wars: The Last Jedi - "More... MORE!!"
-
kvn95ss you may see some extra speed up if you break up all your intervals one more time. So for chromosome 1 and 2, they would be split 4 times, and all the other chromosomes would be split in half. We typically run whole genomes 50 ways parallel.
-
kvn95ss and Genevieve Brandt (she/her)
Regarding the bypass-feature reader recommendation, the documentation mentions the VCFs must be normalized. Are reblocked gvcfs normalized and ready to use or is there another step involved?
-bypass-feature-reader false Use htslib to read input VCFs instead of GATK's FeatureReader. This will reduce memory usage and potentially speed up the import. Lower memory requirements may also enable parallelism through max-num-intervals-to-import-in-parallel. To enable this option, VCFs must be normalized, block-compressed and indexed. -
jfarrell the option --bypass-feature-reader needs the VCF to have no duplicate records. This means the same site is not found in multiple lines of the VCF. The default GATK pipeline does not produce this output, so if you are creating your VCF with GATK you should be fine. If you merge VCFs naively or use a different variant calling software, you might get duplicate records, which does not work with --bypass-feature-reader.
Thanks for writing in to clarify!
Please sign in to leave a comment.
17 comments