Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Speeding up GenotypeGVCFS? GATK4

0

3 comments

  • Avatar
    David Benjamin

    Cecilia Kardum Hjort GenotypeGVCFs can run slowly because the GenomicsDB has to be loaded in memory.  If it takes up too much RAM things can run very slowly.  You will probably get things to run faster just by serializing over different chromosomes.  To do this, use the -L <chromosome> argument for GenomicsDBImport and GenotypeGVCFs.  That is (in bash):

    for n in {1..19}; do

      gatk GenomicsDBImport -L $n <rest of command same as before>

      gatk GenotypeGVCFs -L $n <rest of command same as before>

    done

    Of course, you could also parallelize by farming out the different jobs on a cluster, or on Terra (https://app.terra.bio/#workspaces/help-gatk/GATK4-Germline-Preprocessing-VariantCalling-JointCalling/workflows/broad-firecloud-dsde/1-4-Joint-Genotyping-HG38)

    0
    Comment actions Permalink
  • Avatar
    Cecilia Kardum Hjort

    David Benjamin Thank for your reply! Could I run only the GenomcisGVCF with the -L option because the GenomicsDB has already been created? 

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Unfortunately, I think the answer is no.  The problem is that `-L` reduces CPU cost by running over only the selected intervals, but it doesn't address the memory issue of loading the whole `GenomicsDB`.  (Someone who is more of an expert may chime in and tell me I'm wrong and that `-L` actually results in only a subset of the `GenomicsDB` being put into RAM, but I don't believe the DB is that smart about `-L`).

    By the way, the reason memory issues can cause runtime problems is that the system starts to spend a lot of time just finding available spots in memory as RAM becomes saturated.  In extreme cases the computer spends inordinate amounts of time moving data back and forth from RAM to hard disk trying to use the disk as virtual memory.  This is called thrashing.

    If breaking up both tasks into chromosome chunks doesn't help let us know.  It's possible that the bumblebee genome has an edge case we haven't encountered before.  

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk