Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomicsDBImport very slow on genome with many contigs

1

13 comments

  • Avatar
    Bhanu Gandham

    Hi,

    1. How much physical memory do you have in addition to 50G?
    2. Is the data remote? Can you please share contents of '--sample-name-map samples.block0.map' file?
    3. You have few enough samples so maybe you should try to use CombinegGVCFs instead of GenomicsDBImport and see if that works.
    4. Last resort would be to reduce the number of contigs to a few hundred and try GenomicsDBImport again

     

    1
    Comment actions Permalink
  • Avatar
    Marley Yeong

    Hi Bhanu,

    Thanks for the reply! 

    1.I'm working on a HPC server with at least 150GB per node, and I'm working with ~60 nodes. 

    2.No the data is local (working on SSD).

    head samples.block0.map:

    101 101.g.block0.vcf.gz
    102 102.g.block0.vcf.gz
    103 103.g.block0.vcf.gz
    104 104.g.block0.vcf.gz
    105 105.g.block0.vcf.gz
    106 106.g.block0.vcf.gz
    107 107.g.block0.vcf.gz
    108 108.g.block0.vcf.gz
    109 109.g.block0.vcf.gz

    3.I have tried that, and it didn't seem faster (tried it on a smaller dataset). 

    4.So right now I'm running GenomicsDBImport on an SSD, and I found that it was writing a lot of data to the tmp dir. The genomicsDBImport steps were about at 1/4th of the progress and the jobs had already used about 7TB of data (tmp and outpur dirs). The SSD is only 20TB so there is no way this would have fit. Is the data produced in the tmp dir removed when the GenomicsDBImport step is finished? 

    Another workaround would be to just paste the contigs together (with 1000 N gap between them). Right now I'm testing GATK 4.1.6 to see if that makes a difference.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Marley Yeong

     

    1. I am very interested in how the workaround( pasting the contigs together) works out for you. Please share your experience after you have tested it. I wonder if the coordinates might be messed up by stitching together contigs with 1000N gaps.
    2. The tmp dir should be cleaned up on exit, yes.

    General Info: GenomicsDBImport uses temporary disk storage during import. So the size of the tmp dir will depend on the size of vcfs imported. The amount of temporary disk storage required can exceed the space available, especially when specifying a large number of intervals. The command line argument `--tmp-dir` can be used to specify an alternate temporary storage location with sufficient space.

     

    1
    Comment actions Permalink
  • Avatar
    Marley Yeong

    Hi Bhanu,

    Thanks again for your reply! 

    I had the time to make a "concatenated" version of the genome, where ~400.000 contigs were pasted together into 50 groups/chromosomes.

    I tested this with a very small dataset consisting of 3 samples with about 3k reads:

    Running on 1/50 chromosomes/blocks of the concatenated genome:

    Haplotypecaller ran in 4 minutes

    GenomicsDBImport ran in 6 seconds.

    GenotypeGVCF ran in 6 seconds.

    Running on 1/200 interval files (consisting of 3620 intervals):

    Haplotypecaller ran in 1 minute.

    GenomicsDBImport ran in 10 hours.

    GenotypeGVCF ran in 6 hours.

     

    As you can see, concatenating the contigs into 50 larger blocks/chromosomes with 1000 N gaps, radically decreases the runtime. Of course I didn't run this on a full dataset and I imagine that the difference will be somewhat smaller with full datasets due to less relative overhead. 

    I will continue the analysis with the concatenated genome, of course there will be an extra step required to translate the variant locations back to the original contig. But at least this provides a workable workaround.

    I will keep you updated on the runtime with the full dataset. If this is truly an inefficiency of GATK I hope you can improve this in future versions! Although I can imagine not a lot of people will have genomes that consist of nearly half a million contigs.

     

     

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Yes please keep me posted on the runtime with the full dataset. I will discuss your findings with the dev team and get back to you. 

    0
    Comment actions Permalink
  • Avatar
    annabeldekker


    Dear Marley and GATK dev team,

    My observations: I am experiencing similar problems using most recent GATK version (4.1.7.0). My goal is to combine 1920 gvcfs (output using gatk haplotypecaller on bwa bam files) of onion exome sequencing (13Gb) material against ~17.000 intervals of an onion assembly genome (consisting out of ~ 80.000 contigs). In an earlier test with only 34 gvcfs (~ 2Mb per file), this analysis took more or less 50 hours using 100Gb ram with 32 threads (500GB RAM available per node).

    Command used:

    gatk GenomicsDBImport --java-options '-Xmx100g' --genomicsdb-workspace-path mypath.my_database --sample-name-map sample_map.txt -L snps.bed --merge-input-intervals true --reader-threads 32 --create-output-variant-index false --max-num-intervals-to-import-in-parallel 20 --ip 5

    My problem: What I experience with the bigger sample set (N = 1920) is that java was giving memory errors (makes sense..), even when putting memory to 200Gb. I am performing the analysis in batches now, but I am really curious whether I am being efficient and using the most out of the tool. I indeed also heard cases from colleagues using chromosome level reference genomes for which the analysis only took a few hours.

    My Question: I am considering your workaround Marley, so I was wondering whether you can recommend this or whether you found other possibilities? Also, dev team, I am curious about your discussions on this topic.

    Thanks a lot in advance,

    Annabel Dekker

     

    0
    Comment actions Permalink
  • Avatar
    Marley Yeong

    Hi Annabel,

    Believe it or not, but the run on the full dataset still hasn't completed. This time it is not the fault of genomicsDBimport though. Using GATK 4.1.6 we ran into a bug, so the whole pipeline had to run again.

    But I would definitely recommend my approach for genomicsDBimport. So just concatenate all the contigs into super scaffolds (with some gaps between the contigs) that have a size more or less equal to a normal chromosome and you should be fine.

    You could "deconvolute" the variants back to the original contigs after genotyping, at least that's how we do it. Unfortunately I am not allowed to share that code with anyone outside the company.

    Cheers,

    Marley

     

    0
    Comment actions Permalink
  • Avatar
    annabeldekker

    Hi Marley,

    Thanks for your quick reply! How long do you expect your genomicsDBimport will take after having solved the other bugs? I really consider taking your advice if the developers from GATK have not been able to find other solutions yet. Let's keep each other in the loop as I expect to be experiencing this more often in the future.. 

    Good luck,

    Annabel

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Thank you for your response and contribution to building GATK knowledge-base in this forum Marley Yeong

    annabeldekker we haven't found a solution for this yet. When you do try this workaround please post your findings/thoughts here so other users can gain from it too.

    1
    Comment actions Permalink
  • Avatar
    Marley Yeong

    Hi Annabel, 

    GenomicsDBimport seems to run about as fast as genotypeGVCFs. Can't really give you a time indication yet, but from different projects it at least seems to be a lot faster than haplotypecaller! 

    Cheers!

    Marley

    0
    Comment actions Permalink
  • Avatar
    Marley Yeong

    Hi all,

    The analysis completed, and gonmicsDBimport ran as fast as I was expecting. I concatenated the 7.5GB genome with ~0,5 million contigs into 150 contigs. And within 2 days all jobs were finished (running genomicsDBimport per contig). 

    Cheers,

    Marley

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Thanks for sharing Marley!

    0
    Comment actions Permalink
  • Avatar
    annabeldekker

    Thanks Marley,

    I eventually decided to scale down on complexity by selecting only certain regions. Together with that I used your workaround by merging contigs with 5000Ns. It speed up the process to 7 hours for the whole pipeline (still left with 1200 contigs, but a new reference of only 33Mb instead of 13Gb). So thank you very much for your ideas! If I decide to do a test on the original genome with the merged contig workaround, I will put my results here.

    Annabel

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk