GenomicsDBImport very slow on genome with many contigs
AnsweredGATK version
GATK 4.1.4.0
Command Example
atk-4.1.4/gatk-4.1.4.0/gatk GenomicsDBImport --java-options "-Xmx50g -Xms50g -Djava.io.tmpdir=/data/work/tmp" --sample-name-map samples.block0.map -L block0.bed --batch-size 68 --tmp-dir=/data/work/tmp --genomicsdb-workspace-path cohort.block0
The problem
I'm running GenomicsDBImport on 68 samples on a genome of ~7.5GB consisting of 442020 contigs and having split these contigs into 100 bed files (each bed file contains about 4420 contigs). These 100 jobs have been running for over 10 days at this point. I have used GenomicsDBImport before on larger genomes (4.5GB for instance) that consisted of full chromosomes instead of small contigs, and GenomicsDBImport finished much faster (more like hours instead of weeks).
I have also tried --merge-input-intervals true but that did not seem to significantly speed up anything. This might have to with the intervals spanning the entire contig/genome, where an exome only contains a fraction of the entire genome.
Could anyone explain why this process is so much slower when there are more contigs, and is there anything I could do about this? Right now I'm testing gatk 4.1.6 to see if that resolves anything.
Thanks in advance!
-
Hi,
- How much physical memory do you have in addition to 50G?
- Is the data remote? Can you please share contents of '--sample-name-map samples.block0.map' file?
- You have few enough samples so maybe you should try to use CombinegGVCFs instead of GenomicsDBImport and see if that works.
- Last resort would be to reduce the number of contigs to a few hundred and try GenomicsDBImport again
-
Hi Bhanu,
Thanks for the reply!
1.I'm working on a HPC server with at least 150GB per node, and I'm working with ~60 nodes.
2.No the data is local (working on SSD).
head samples.block0.map:
101 101.g.block0.vcf.gz
102 102.g.block0.vcf.gz
103 103.g.block0.vcf.gz
104 104.g.block0.vcf.gz
105 105.g.block0.vcf.gz
106 106.g.block0.vcf.gz
107 107.g.block0.vcf.gz
108 108.g.block0.vcf.gz
109 109.g.block0.vcf.gz3.I have tried that, and it didn't seem faster (tried it on a smaller dataset).
4.So right now I'm running GenomicsDBImport on an SSD, and I found that it was writing a lot of data to the tmp dir. The genomicsDBImport steps were about at 1/4th of the progress and the jobs had already used about 7TB of data (tmp and outpur dirs). The SSD is only 20TB so there is no way this would have fit. Is the data produced in the tmp dir removed when the GenomicsDBImport step is finished?
Another workaround would be to just paste the contigs together (with 1000 N gap between them). Right now I'm testing GATK 4.1.6 to see if that makes a difference.
-
Hi Marley Yeong
- I am very interested in how the workaround( pasting the contigs together) works out for you. Please share your experience after you have tested it. I wonder if the coordinates might be messed up by stitching together contigs with 1000N gaps.
- The tmp dir should be cleaned up on exit, yes.
General Info: GenomicsDBImport uses temporary disk storage during import. So the size of the tmp dir will depend on the size of vcfs imported. The amount of temporary disk storage required can exceed the space available, especially when specifying a large number of intervals. The command line argument `--tmp-dir` can be used to specify an alternate temporary storage location with sufficient space.
-
Hi Bhanu,
Thanks again for your reply!
I had the time to make a "concatenated" version of the genome, where ~400.000 contigs were pasted together into 50 groups/chromosomes.
I tested this with a very small dataset consisting of 3 samples with about 3k reads:
Running on 1/50 chromosomes/blocks of the concatenated genome:
Haplotypecaller ran in 4 minutes
GenomicsDBImport ran in 6 seconds.
GenotypeGVCF ran in 6 seconds.
Running on 1/200 interval files (consisting of 3620 intervals):
Haplotypecaller ran in 1 minute.
GenomicsDBImport ran in 10 hours.
GenotypeGVCF ran in 6 hours.
As you can see, concatenating the contigs into 50 larger blocks/chromosomes with 1000 N gaps, radically decreases the runtime. Of course I didn't run this on a full dataset and I imagine that the difference will be somewhat smaller with full datasets due to less relative overhead.
I will continue the analysis with the concatenated genome, of course there will be an extra step required to translate the variant locations back to the original contig. But at least this provides a workable workaround.
I will keep you updated on the runtime with the full dataset. If this is truly an inefficiency of GATK I hope you can improve this in future versions! Although I can imagine not a lot of people will have genomes that consist of nearly half a million contigs.
-
Yes please keep me posted on the runtime with the full dataset. I will discuss your findings with the dev team and get back to you.
-
Dear Marley and GATK dev team,My observations: I am experiencing similar problems using most recent GATK version (4.1.7.0). My goal is to combine 1920 gvcfs (output using gatk haplotypecaller on bwa bam files) of onion exome sequencing (13Gb) material against ~17.000 intervals of an onion assembly genome (consisting out of ~ 80.000 contigs). In an earlier test with only 34 gvcfs (~ 2Mb per file), this analysis took more or less 50 hours using 100Gb ram with 32 threads (500GB RAM available per node).
Command used:
gatk GenomicsDBImport --java-options '-Xmx100g' --genomicsdb-workspace-path mypath.my_database --sample-name-map sample_map.txt -L snps.bed --merge-input-intervals true --reader-threads 32 --create-output-variant-index false --max-num-intervals-to-import-in-parallel 20 --ip 5
My problem: What I experience with the bigger sample set (N = 1920) is that java was giving memory errors (makes sense..), even when putting memory to 200Gb. I am performing the analysis in batches now, but I am really curious whether I am being efficient and using the most out of the tool. I indeed also heard cases from colleagues using chromosome level reference genomes for which the analysis only took a few hours.
My Question: I am considering your workaround Marley, so I was wondering whether you can recommend this or whether you found other possibilities? Also, dev team, I am curious about your discussions on this topic.
Thanks a lot in advance,
Annabel Dekker
-
Hi Annabel,
Believe it or not, but the run on the full dataset still hasn't completed. This time it is not the fault of genomicsDBimport though. Using GATK 4.1.6 we ran into a bug, so the whole pipeline had to run again.
But I would definitely recommend my approach for genomicsDBimport. So just concatenate all the contigs into super scaffolds (with some gaps between the contigs) that have a size more or less equal to a normal chromosome and you should be fine.
You could "deconvolute" the variants back to the original contigs after genotyping, at least that's how we do it. Unfortunately I am not allowed to share that code with anyone outside the company.
Cheers,
Marley
-
Hi Marley,
Thanks for your quick reply! How long do you expect your genomicsDBimport will take after having solved the other bugs? I really consider taking your advice if the developers from GATK have not been able to find other solutions yet. Let's keep each other in the loop as I expect to be experiencing this more often in the future..
Good luck,
Annabel
-
Thank you for your response and contribution to building GATK knowledge-base in this forum Marley Yeong!
annabeldekker we haven't found a solution for this yet. When you do try this workaround please post your findings/thoughts here so other users can gain from it too.
-
Hi Annabel,
GenomicsDBimport seems to run about as fast as genotypeGVCFs. Can't really give you a time indication yet, but from different projects it at least seems to be a lot faster than haplotypecaller!
Cheers!
Marley
-
Hi all,
The analysis completed, and gonmicsDBimport ran as fast as I was expecting. I concatenated the 7.5GB genome with ~0,5 million contigs into 150 contigs. And within 2 days all jobs were finished (running genomicsDBimport per contig).
Cheers,
Marley
-
Thanks for sharing Marley!
-
Thanks Marley,
I eventually decided to scale down on complexity by selecting only certain regions. Together with that I used your workaround by merging contigs with 5000Ns. It speed up the process to 7 hours for the whole pipeline (still left with 1200 contigs, but a new reference of only 33Mb instead of 13Gb). So thank you very much for your ideas! If I decide to do a test on the original genome with the merged contig workaround, I will put my results here.
Annabel
-
Hello, I would like to follow-up on this issue. Has there been any improvement to this problem? I've used GATK 4.2.0 and 4.2.2 mainly.
I have hundreds of high-depth genomes that I need to run and would like to use GenomicsDBImport, but it scales extremely poorly with my reference. Similarly, CombineGVCFs is unusably slow due to the number of samples. I've split the reference using SplitIntervals with various subdivision modes that maintain the contigs and run each interval separately in GenomicsDBImport. The interval that contains the thousands of small contigs, takes a great deal longer to process than the others (albeit generally faster to genotype with GenotypeGVCFs). After each interval is imported and genotyped, I gather the VCFs back together with SortVcf (very fast). I've also tried --genomicsdb-shared-posixfs-optimizations with no significant improvement in performance.Thanks,
Kyle -
Hi Stiers, Kyle,
Please see this document on the best practices and suggestions for speeding up GenomicsDBImport and let me know if it is helpful. There are still users experiencing long run times with GenomicsDBImport and GenotypeGVCFs, but there have been improvements and there are several more recent forum posts on this topic that may contain some useful suggestions (i.e. https://gatk.broadinstitute.org/hc/en-us/community/posts/360074780432-Speeding-up-GenotypeGVCFs-from-genomicsdb)
Kind regards,
Pamela
-
Hi Pamela,
Thanks for your prompt reply! I have read that page and both links that Genevieve shared and pretty much every community forum post I can find on this interplay of GenomicsGBImport and GenotypeGVCF trying to optimize things. Unfortunately, I have not seen much in the way of speedup.
I am using Nextflow, so both steps are already being run in parallel per interval (not always simply 1 chromosome due to the number of trailing contigs). With the nature of study we're doing, I'd really like to avoid --max-genotype-count and --max-alternate-alleles if possible. Also, those options seem to reduce memory, but not necessarily cpu time.
In the best practices document there is mention of --merge-contigs-into-num-partitions, which is the only thing I see that I haven't tried. However, it's not very feasible because I have thousands of contigs and they have to be ran individually it seems like? I've also tried --merge-input-intervals but it was problematic...might be worth re-visiting and trying again. -
Hi Stiers, Kyle,
Thank you for trying all of the suggested solutions. Given the large scale of your study, there may not be an ideal way to get around the long run times you are experiencing. However, --merge-contigs-into-num-partitions may be a valid option because this is useful when you have really high numbers of contigs. You can find more information about the option in the GenomicsDBImport documentation to determine if it might be useful for you.
Kind regards,
Pamela
Please sign in to leave a comment.
17 comments