Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Speeding up GenotypeGVCFS? GATK4

Answered
0

17 comments

  • Avatar
    David Benjamin

    Cecilia Kardum Hjort GenotypeGVCFs can run slowly because the GenomicsDB has to be loaded in memory.  If it takes up too much RAM things can run very slowly.  You will probably get things to run faster just by serializing over different chromosomes.  To do this, use the -L <chromosome> argument for GenomicsDBImport and GenotypeGVCFs.  That is (in bash):

    for n in {1..19}; do

      gatk GenomicsDBImport -L $n <rest of command same as before>

      gatk GenotypeGVCFs -L $n <rest of command same as before>

    done

    Of course, you could also parallelize by farming out the different jobs on a cluster, or on Terra (https://app.terra.bio/#workspaces/help-gatk/GATK4-Germline-Preprocessing-VariantCalling-JointCalling/workflows/broad-firecloud-dsde/1-4-Joint-Genotyping-HG38)

    0
    Comment actions Permalink
  • Avatar
    Cecilia Kardum Hjort

    David Benjamin Thank for your reply! Could I run only the GenomcisGVCF with the -L option because the GenomicsDB has already been created? 

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Unfortunately, I think the answer is no.  The problem is that `-L` reduces CPU cost by running over only the selected intervals, but it doesn't address the memory issue of loading the whole `GenomicsDB`.  (Someone who is more of an expert may chime in and tell me I'm wrong and that `-L` actually results in only a subset of the `GenomicsDB` being put into RAM, but I don't believe the DB is that smart about `-L`).

    By the way, the reason memory issues can cause runtime problems is that the system starts to spend a lot of time just finding available spots in memory as RAM becomes saturated.  In extreme cases the computer spends inordinate amounts of time moving data back and forth from RAM to hard disk trying to use the disk as virtual memory.  This is called thrashing.

    If breaking up both tasks into chromosome chunks doesn't help let us know.  It's possible that the bumblebee genome has an edge case we haven't encountered before.  

    0
    Comment actions Permalink
  • Avatar
    jfarrell

    Based on some  runs with GATK 4.2.0.0, it does seem to use a lot of memory no matter the size of the interval and it correlates with the size of the chromosome.   The memory used increases as the chromosome size increases when the genomicsDBs are split by chr 1-22.  GenotypeGVCFs an interval of less than 1 mb still uses 60-120 GB depending on the chromosome for 4000 samples.    Instead of being able to run 16 in parallel on a 196gb compute node our pipeline can only run 2 or 3 jobs because of the memory requirements. Why can't GenotypeGVCFs minimize memory requirements by reading in only what is needed for the joint-genotyping of the region? 

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi jfarrell,

    There are some previous forum posts that discuss ways to work around the memory allocation that is necessary for reading a GenomicsDB into GenotypeGVCFs such as this one: https://gatk.broadinstitute.org/hc/en-us/community/posts/360074780432-Speeding-up-GenotypeGVCFs-from-genomicsdb

    Additionally, this document contains some helpful tips for speeding up GenomicsDB which also apply to running GenotypeGVCFs from a GenomicsDB workspace: https://gatk.broadinstitute.org/hc/en-us/articles/360056138571-GDBI-usage-and-performance-guidelines

    I hope you find this information helpful.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    jfarrell

    These are useful suggestions on speeding up GenotypwGVCFs which I am using.  None of them address the large amount of memory used no matter how small the target size that is specified.  GenotypeGVCFs spends a lot of time initializing with IO and using a lot of memory   before it even starts calling variants.   The large amount of memory required impacts  the ability to parallelize  the jobs on HPCC clusters.  What is going on during the GenotypeGVCFs initialization that requires so much I/O and memory?

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi jfarrell,

    The size of the intervals specified does not matter as much as the number of intervals that you when determining how much memory is used. The total amount of data that you are inputting to GenotypeGVCFs is likely causing the large amount of memory needed to initialize the tool. Are you still able to get it to run successfully despite the memory requirements?

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    jfarrell

    It runs to completion  if I  give the job a lot of memory (80 -120 gb)  and target a small interval for the genomicsDB with  2000 samples for each chromosome. 

    The output vcf.gz is only 55M for only a  chr3-5882072-5975713.vcf.gz.   

     Unzipped it is only 350 MB. 
    -rw-r--r-- 1 farrell kageproj 336M Jul 22 16:42 unzipped.vcf

    The memory used is way out of proportion to the actual size of the  vcf being generated.   Why is so much memory being used? GenotypeGVCFs is extremely memory inefficient and limits the number of parallel jobs that can be run. 

    Are there any benchmarks describing the memory/cpu requirements for various cohort sizes to run  Dragen/GATK? 

     

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi jfarrell,

    There are several factors that contribute to the memory that needs to be used other than the size of the output file. For example, it takes a greater amount of memory to genotype sites with many alternate alleles or genotype counts. The memory allocation, therefore, has more to do with the data the tool needs to read and analyze rather than the information it outputs. There is information about some optional arguments that might help reduce memory usage in the GenotypeGVCFs tool documentation (--max-genotype-count, --max-alternate-alleles, --disable-bam-index-caching).

    I'm not currently aware of a document that lists the memory requirements for different cohort sizes, I would imagine that this depends on many other factors including the machine you are running it on. I'm sorry that I don't have more useful information for you, but I'm glad that you were still able to get the job to run to completion.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    kvn95ss

    Sorry for bumping an old thread, but I am facing similar problems with the memory requirements and inability to scatter across multiple smaller intervals due to server hardware limitations.

    I was going to recommend some method of caching the input gvcf files in memory, akin to STAR aligners LoadAndExit option, where a copy of reference genome is cached in memory to make subsequent alignments quicker. However, since there are other calculations that are taking up the memory, I'm not sure how useful this approach would be.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for the post kvn95ss. We don't think this caching would improve the memory problems, because at this point we don't even load the GVCFs, just the bits we are working with. We already do something like this with the reference. 

    More caching probably wouldn't help memory anyway, it might make the memory problems worse. Though it may help the performance. Are your main issues with performance or memory?

    0
    Comment actions Permalink
  • Avatar
    jfarrell

    This is a potential solution that I will be testing out with my next batch of crams. There is now support in the 4.2.6.1 CombineGVCFs and GenotypeGVCFs for "reblocked" GVCFs as produced by the ReblockGVCF tool. Reblocked GVCFs have a significantly reduced storage footprint.   I expect the reduced size will help with both the speed and memory requirements for the joint genotyping.  For one sample's chr1 gvcf, the g.vcf.gz   dropped from 827mb to 134mb in the reblocked g.vcf.gz.  Hopefully that smaller file size will translate into less memory, i/o and computer time for the genotypeGVCFs  step. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Good point jfarrell! Yes, this can really help.

    0
    Comment actions Permalink
  • Avatar
    kvn95ss

    jfarrell and Genevieve Brandt (she/her) using normal gvcf, creating genomics db using genomicsdb and genotypegvcf used to take ~24 hours (around 1.9k samples, split across each chr, while chr1 and 2 where further split in half).

    With the same setup, but using reblock vcf and bypass-feature reader, the time went down to ~12 hours! That's 50% reduction in time!

     

    But as Kylo Ren said in Star Wars: The Last Jedi - "More... MORE!!"

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    kvn95ss you may see some extra speed up if you break up all your intervals one more time. So for chromosome 1 and 2, they would be split 4 times, and all the other chromosomes would be split in half. We typically run whole genomes 50 ways parallel. 

    0
    Comment actions Permalink
  • Avatar
    jfarrell

    kvn95ss and Genevieve Brandt (she/her)

    Regarding the bypass-feature reader recommendation, the documentation mentions the VCFs must be normalized.  Are reblocked gvcfs normalized and ready to use or is there another step involved?

    -bypass-feature-reader false Use htslib to read input VCFs instead of GATK's FeatureReader. This will reduce memory usage and potentially speed up the import. Lower memory requirements may also enable parallelism through max-num-intervals-to-import-in-parallel. To enable this option, VCFs must be normalized, block-compressed and indexed.
    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    jfarrell the option --bypass-feature-reader needs the VCF to have no duplicate records. This means the same site is not found in multiple lines of the VCF. The default GATK pipeline does not produce this output, so if you are creating your VCF with GATK you should be fine. If you merge VCFs naively or use a different variant calling software, you might get duplicate records, which does not work with --bypass-feature-reader.

    Thanks for writing in to clarify!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk