Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Strategy for buildling GenomicsDB

0

17 comments

  • Avatar
    Bhanu Gandham

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

     

     

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hi Bernt Guldbrandtsen , this old blog discusses multi-interval support for GenomicsDB...maybe it will lend some context: https://github.com/broadinstitute/gatk-docs/blob/master/blog-2012-to-2019/2019-02-22-Improvements_in_Germline_Short_Variants_Calling_Part_1.md

    0
    Comment actions Permalink
  • Avatar
    Bernt Guldbrandtsen

    Hi Tiffany,

    Thank you for the link. I looked through it, and it says GATK should be able to run in parallel. However, GATK-4.1.4.1 upon initialization says:

    "20:33:38.220 WARN GenomicsDBImport - GenomicsDBImport cannot use multiple VCF reader threads for initialization when the number of intervals is greater than 1. Falling back to serial VCF reader initialization."

    In addition, even after running for 5 days on a fast server, plenty of memory and fast fileserver access, it hadn't even finished importing the g.vcf.gz file for the first individual. Clearly, what I'm doing isn't working, but I'm clueless as to what I'm doing wrong.

    The complete command line is:

    gatk --java-options "-Xmx100g -Xms100g" GenomicsDBImport --genomicsdb-workspace-path Cattle --sample-name-map cattle.sample_map -L chr.list --tmp-dir=/usr/home/qgg/bernt/tmp --reader-threads=25

    The sample map contains IDs and paths for 388 animals, and list of chromosomes contains all 29 autosomes, two sex chromosomes and the MT.

    Any suggestions?

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hi Bernt!

    This is a warning message, not an error, and it means you can't use the --reader-threads argument. You are only able to do this when one interval is specified. In terms of optimizing your jobs, here are a few ideas:

    1. Make sure you aren't using all of the physical memory available for the JVM via Xmx. Is there more than 100gb available to you? You probably want to check you have like 130gb total available with a 100 Xmx set.

    2. You can run all samples at once but run them in batches per interval. Run one interval at a time, then assemble everything in a final genotypegvcf run.

    3. Either way, try the --batch-size argument to limit memory use. You could try starting with --batch-size 50.

    Let us know if any of this helps!

     

    0
    Comment actions Permalink
  • Avatar
    Bernt Guldbrandtsen

    Hi Tiffany,

    Thank you for your reply.

    1. To check, I increased Xms to 200g. 

    2. I don't understand the latter part of the comment. I'm running a GenomicsDBImport. Given 29 autosomes, the X, the MT plus a couple of thousand shorter unassembled scaffolds, that will produce a couple of thousand GenomicsDB databases. What does "assemble everything" then mean? Should I somehow later concatenate/combine/assemble the GenomicsDB databases?

    3. I have added the batch-size argument. With only one interval (BTA1) in the job, the warning about multi thread being disabled has gone away. However, the process (according to top) still on average only uses about 1/3 of one core (24 virtual cores allocated). The content of the GenomicsDB database grows by about 1 Mb per second.

    It seems there still is something very basic that I don't get about how to approach this. 

    Best regards,

    Bernt

    0
    Comment actions Permalink
  • Avatar
    Bernt Guldbrandtsen

    Hello again,

    now the first batch of 50 samples has finished for the shortest autosome (BTA25). It took 36 hours. That means the total time to import the shortest autosome is going to take about two weeks. Utilization of processors is less than 20 %.

    The longest autosome (BTA1) is almost four times as long and would therefore take on the order of two months to import.

    What may be slowing down the process?

    As the jobs started with multiple threads did not actually use more than one core, I started the jobs like this:

    $gatk --java-options "$jopt" \
    GenomicsDBImport \
    --genomicsdb-workspace-path Cattle_$chr \
    --sample-name-map cattle.sample_map \
    --batch-size 50 \
    -L $chr \
    --tmp-dir=$TMPDIR \
    --reader-threads=1

    Best regards,

    Bernt

    0
    Comment actions Permalink
  • Avatar
    sahuno

    Hi Bernt Guldbrandtsen what were your java options?

    `$gatk --java-options "$jopt" \
    GenomicsDBImport \............

    I'm following this post. 

    0
    Comment actions Permalink
  • Avatar
    Bernt Guldbrandtsen

    Hi Samuel,

    I have tried various sets of values. The amounts of memory, I have tried did not seem to make a difference for speed. 

    The current batch of jobs runs with 

    jopt="-Xmx10g -Xms5g"

    Currently, they run on segments of 5 Mb. For cattle, that means about 550 jobs (and instances of GenomicsDB) leaving out the small scaffolds. After 17 hours, the slowest job is working on batch 3, the fastest job is running on batch 5 corresponding to an approximate range from 2 to 5 hours per batch of 50 samples. 

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hi Bernt Guldbrandtsen,

    You could break up your intervals even more and run them in parallel to see if that reduces the time.  

    Here is a great thread where users discuss how they've broken down the work and there is some interesting discussion during Geraldine's presentation on the topic here

    0
    Comment actions Permalink
  • Avatar
    Bernt Guldbrandtsen

    Hi Tiffany Miller,

    Thank you for your reply. The other threads do not help solve this.

    Parallelization does not seem to do the trick. In the first seconds according to "top", the java process does indeed use multiple cores. However, after about 10-20 seconds cpu utilization drops to <10% most of time with occasional peaks of, say, 20 % utilization.

    This pattern does not change with the amount of memory allocated (tried up to 200Gb) and the number of cores (range 1 to 24). This was tested on a 24 virtual/12 physical core top-of-the-line Intel CPU with 384 Gb of RAM and a 10 gigabit connection to the file server and with an internal SSD for temp files. The machine was doing nothing else at the time. 

    Now, after about two weeks with around 100 processes running at any time, I'm well short of halfway through the import process (388 samples with approx. 10X sequencing data). 

    This makes GenomicsDBImport by far the most demanding step in the WGS pipeline (not having tried calling variants based on data from GenomicsDB).

    It looks like the import somehow gets bottlenecked, but the bottleneck is not due to RAM (RAM usage far below allocated RAM), CPU (process utilizes <<1% of allocated resources), tmp space io (SSDs) or file server io. I had our sysadmin check and the file server is essentially twiddling its thumbs.

    Best regards,

    Bernt

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Thanks for describing in more detail Bernt Guldbrandtsen . I am following up with the team on this and will get back to you. 

    0
    Comment actions Permalink
  • Avatar
    Bernt Guldbrandtsen

    Hi Tiffany,

    Thank you!

    Our sysadmin moved the relevant files to a different file server (zfs, ssd, 10 Gb network), usually never breaks a sweat under heavy load.

    I tried a job allocating 23 (virtual) cores (12 physical available), and a lot of RAM.

    For one job, running alone on the server, the job created a load of about 50 % of one core on average. Only early on in the process does the job ever exceed a load of 100 % of a core.

    In the first 10 or 20 seconds, the job does in fact create a heavier load, but then falls to a low sustained level.

    Best regards,

    Bernt

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hi Bernt Guldbrandtsen 

    I confirmed that what you are experiencing is probably related to a bug #6487.

    A few follow up questions:

    1. Can you try using the TILEDB_DISABLE_FILE_LOCKING environment variable:

    % export TILEDB_DISABLE_FILE_LOCKING=1
    % gatk GenomicsDBImport ...

    2. What is the total size of your vcfs?

    3. Can you try importing a small segment to SSD and compare that with the import to the shared filesystem? This will also give us some insight into how the shared filesystem is performing compared to the SSD.

    4. The team is working on ways to optimize reading/writing from shared file systems like NFS and Lustre. We can push those changes into a gatk branch if you'd be willing to try that branch?

    Many thanks,
    Tiff

     

    0
    Comment actions Permalink
  • Avatar
    Bernt Guldbrandtsen

    Dear Tiff,

    Thank you for your reply, and sorry for the very slow response. As before (for consistency), I'm testing with version gatk-4.1.6.0. For reference, the command line is listed at the bottom.

    1. Setting

    export TILEDB_DISABLE_FILE_LOCKING=1

    does not seem to change anything. Running with 23 threads with lots of RAM, CPU load is between 0.2 and 0.5 CPUs, i.e., CPU use efficiency is about 1 to 2 %.

    2. At the time, the total volume of g.vcfs was just over 2 TB. Currently, the total volume is about twice that. CPU usage is about about 1 CPU (23 allocated).

    3. Importing to an internal SSD happens at a speed of about 2.3 MB/s, which is better than on the file server.

    4. Sure, however, I would need to ask you compile it for; in the past I have not succeeded in compiling gatk by myself.

    Best wishes,

    Bernt Guldbrandtsen

    ---

    gatk=./gatk/gatk-4.1.6.0/gatk
    jopt="-Xmx100g -Djava.io.tmpdir="$TMPDIR
    walker=GenomicsDBImport

    db="$chr/Cattle_${chr}_${start}_${end}";

    $gatk --java-options "$jopt" \
    $walker \
    --genomicsdb-workspace-path $db \
    --sample-name-map cattle.sample_map2 \
    --batch-size 50 \
    -L $chr:$start-$end \
    --tmp-dir=$TMPDIR \
    --reader-threads=23 > $chr/import_${chr}_${start}_${end}.log 2>&1

     

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hi Bernt,

    No worries and thanks for your responses. The latest comment in the bug thread from nalinigans says:

    "4.1.8.0 has a new option - --genomicsdb-shared-posixfs-optimizations for GenomicsDBImport that disable file locking and minimize writes to NFS. We are interested to know if this option helps your use case even as we continue making performance improvements." 

    Would you be able to try this out and comment there about the results so that Nalini can see?

    0
    Comment actions Permalink
  • Avatar
    Bernt Guldbrandtsen

    HI Tiff,

    Thank you again for your reply.

    With version 4.1.8.1 and the new option added, the import proceeds at about 1.5 MB per second running against the NFS file server.

    A thing that puzzles me is that the process only ever seems to run single-threaded despite allowing more reader threads (currently 12). Only very briefly in the beginning of the run does java produce a CPU load of more than 1 CPU (up to about 2). Is multi-threading somehow being ignored?

    Best wishes,

    Bernt

    ---

    Command line

    $gatk --java-options "-Xmx100g -Djava.io.tmpdir=$TMPDIR" \
    $walker \
    --genomicsdb-workspace-path $db \
    --genomicsdb-shared-posixfs-optimizations \
    --sample-name-map cattle.sample_map2 \
    --batch-size 50 \
    -L $chr:$start-$end \
    --tmp-dir $TMPDIR \
    --reader-threads 12 > $chr/import_${chr}_${start}_${end}.log 2>&1

    0
    Comment actions Permalink
  • Avatar
    Nalini Ganapati

    Hi Bernt,

    From the gatk docs for GenomicsDBImport, --reader-threads is for reading vcfs only. For the import itself, you will have to use the --max-num-intervals-to-import-in-parallel option. Of course, if you have only one interval, it does not matter, the default is 1.

    --max-num-intervals-to-import-in-parallel:Integer
    Max number of intervals to import in parallel; higher values may improve performance, but
    require more memory and a higher number of file descriptors open at the same time Default
    value: 1.

    --reader-threads:Integer How many simultaneous threads to use when opening VCFs in batches; higher values may
    improve performance when network latency is an issue. Multiple reader threads are not
    supported when running with multiple intervals. Default value: 1.

    I have a few questions -

    1. How much physical memory do you have in addition to java option of -Xmx100G? GenomicsDBImport uses a native C/C++ library that can get starved out if most of the physical memory is available only to the Java VM.
    2. > With version 4.1.8.1 and the new option added, the import proceeds at about 1.5 MB per second running against the NFS file server.
      You mention earlier that on an internal SSD, the import proceeds at 2.3 MB/s. Is this acceptable performance for you? How are you measuring the performance?
    3. How large is your tmpdir?
    4. How many intervals are you currently importing?
    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk