Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GATK GenotypeGVCFs stuck on starting traversal

Answered
1

9 comments

  • Avatar
    Mary Happ

    Small update, I have found that not maxing out the java -Xms and -Xmx options with respect to the memory I request runs GenomicsDBImport much faster so for now I am creating a new database with the --consolidate --genomicsdb-shared-posixfs-optimizations options which should finish in 1-2 days and then I can retry GenotypeGVCFs

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for the update Mary Happ! And your thorough post is very helpful for trying to troubleshoot. It seems that you have really read the other posts thoroughly and are following all of our recommendations. Since you have found a way to run GenomicsDBImport faster, that's probably the best step for now. If you can send another update by Monday midmorning, I can ask any remaining questions to my colleagues who are experts in optimizing genomicsdb workspace usage.

    0
    Comment actions Permalink
  • Avatar
    Mary Happ

    Hi Genevieve - A couple updates since Wednesday. 

    1) Running GenomicsDBImport with the --consolidate flag had some problems for me. After all batches were imported the program stalls and eventually fails with the error: 

    [TileDB::utils] Error: (gzip_handle_error) Cannot decompress with GZIP: inflate error: Z_BUF_ERROR [TileDB::Codec] 
    Error: Could not compress with . [TileDB::ReadState]
    Error: Cannot decompress tile.
    VariantStorageManagerException exception : Error while consolidating TileDB array.

    2) I reran my scripts without the consolidate flag (still retaining the -genomicsdb-shared-posixfs-optimizations flag) and this afternoon some of the intervals have finished. So I tried starting GenotypeGVCFs on those intervals and I am still stuck at starting traversal after a couple hours. 

    15:27:40.133 INFO  GenotypeGVCFs - Done initializing engine
    15:27:40.605 INFO  ProgressMeter - Starting traversal
    15:27:40.610 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
    0
    Comment actions Permalink
  • Avatar
    Mary Happ

    Hi Genevieve - so over the weekend I played with the memory I was requesting for the GenotypeGVCFs jobs. After increasing the requested memory to 150gb and requesting about 120 of it for Java with 'gatk --java-options "-Xms10g -Xmx110g" GenotypeGVCFs' I got the other intervals to be able to start after ~3 hours. They then run fairly quickly, about another 4.5 hours until completion. This is still just happening on intervals past the first one, which starts traversal quickly, but at least I should able to complete the analysis for the time being. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Mary,

    I spoke with our GenomicsDB team about your issue and have a few recommendations if you still want to try them out.

    1. First, we recommend that you update your our newest GATK version to 4.2.3.0. It contains a new GenomicsDB version and we think it could help with the problems you are seeing in GenotypeGVCFs. You wouldn't need to re-create your workspaces, you can just run GenotypeGVCFs with the new version. We think that this new version will make the --consolidate flag successful without the TileDB error.
    2. Are your runs all contained in different GenomicsDB workspaces? If you are running multiple GenotypeGVCF jobs on the same workspace, there could be issues.
    3. Do you have many small chromosomes? This could lead to a slow down upon start up. We think that the new GATK version could help with this.

    Let me know how it goes and if you have any further questions.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Mary Happ

    Ok, I've updated to the newest version and I will let you know how things go there. As for your other two points - Yes, all my runs are contained to different workspaces (separate folders for each interval nested within a folder for each chromosome). My genome has 20 chromosomes with sizes ranging from 38-58 million bp so I don't think they are particularly small but I guess that might be contextual! 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Ok, let me know if you see any improvement! I don't think that your chromosomes would be causing the issue with startup. We usually see that problem when people have thousands of chromosomes (like doing exome analysis).

    0
    Comment actions Permalink
  • Avatar
    Rômulo Carleial

    Hi Genieve. This is an old thread but GATK is giving me a lot of trouble with the same issue described above. Two 5million bp intervals are stuck at the traversal step. Last time it took GATK more than 6 days to initialise the snp calling, but my HPC only allows for 10 days job (maximum), so it ran out of time. I am trying to re-run the scripts but I am facing the same issue.

    This is very weird as most 5million bp intervals worked fine bar these two and a few others which also took days to initialize, but manage to finish in time. These 2 intervals are in different workspaces, and I am using gatk 4.2.6.1. 

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    Hi Rômulo Carleial,

    Are you working with an organism that has high ploidy?  The two things that slow down GenotypeGVCFs the most are high ploidy and high numbers of alternate alleles.  I believe that you can most efficiently control the number of alternate alleles with `--genomicsdb-max-alternate-alleles`.  Set that to maybe 2 or 3 more than the largest number you want to see in your output VCF.  If a site has more than that number of alt alleles (including the non-ref), then it will get skipped, which should save a lot of compute composing the PLs.  I believe the site will get output in the final VCF, but the genotypes will all be no-calls.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk