Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

How do I SelectVariants from GenomicsDB stored in GCS?

1

29 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Lucas Taniguti,

    Yes, you run SelectVariants this way. For a GenomicsDB workspace in a bucket, you need to use the gendb-gs:// prefix to access it. It should work if you run it that way.

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Lucas Taniguti

    Thank you @Genevieve Brandt, it has started to work with gendb.gs://

    But now I think it does not run. I have only one sample stored into the database and I'm selecting only chr20:1-1000000 and it is running for more than 30 minutes. Is it expected?

    I'm using a VM from GCE, in the same region as the GCS bucket.

    Using GATK jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
    Running:
       java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10g -Xms5g -
    jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar SelectVariants -R Homo_sapiens_assembly38.fasta -V gendb.gs://mybucket/genomicsdb -L chr20:1-1000000 -O teste.
    vcf.gz
    23:01:23.595 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compres
    sion.so
    23:01:23.914 INFO  SelectVariants - ------------------------------------------------------------
    23:01:23.915 INFO  SelectVariants - The Genome Analysis Toolkit (GATK) v4.1.9.0
    23:01:23.915 INFO  SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
    23:01:23.918 INFO  SelectVariants - Executing as taniguti@phasing-shapeit4-taniguti on Linux v5.4.0-1036-gcp amd64
    23:01:23.918 INFO  SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1+1-Ubuntu-0ubuntu1.20.04
    23:01:23.919 INFO  SelectVariants - Start Date/Time: February 1, 2021 at 11:01:23 PM UTC
    23:01:23.919 INFO  SelectVariants - ------------------------------------------------------------
    23:01:23.919 INFO  SelectVariants - ------------------------------------------------------------
    23:01:23.928 INFO  SelectVariants - HTSJDK Version: 2.23.0
    23:01:23.929 INFO  SelectVariants - Picard Version: 2.23.3
    23:01:23.929 INFO  SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    23:01:23.929 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    23:01:23.929 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    23:01:23.929 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    23:01:23.930 INFO  SelectVariants - Deflater: IntelDeflater
    23:01:23.930 INFO  SelectVariants - Inflater: IntelInflater
    23:01:23.930 INFO  SelectVariants - GCS max retries/reopens: 20
    23:01:23.930 INFO  SelectVariants - Requester pays: disabled
    23:01:23.930 INFO  SelectVariants - Initializing engine
    23:01:25.939 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    23:01:39.847 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field AS_InbreedingCoeff  - the field will NOT be part of INFO fields in the g
    enerated VCF records
    23:01:39.847 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field AS_QD  - the field will NOT be part of INFO fields in the generated VCF
    records
    23:01:39.848 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field DS  - the field will NOT be part of INFO fields in the generated VCF rec
    ords
    23:01:39.848 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field InbreedingCoeff  - the field will NOT be part of INFO fields in the gene
    rated VCF records
    23:01:39.848 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field MLEAC  - the field will NOT be part of INFO fields in the generated VCF
    records
    23:01:39.848 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field MLEAF  - the field will NOT be part of INFO fields in the generated VCF
    records
    23:01:51.886 INFO  IntervalArgumentCollection - Processing 1000000 bp from intervals
    23:01:51.918 INFO  SelectVariants - Done initializing engine
    23:01:52.050 INFO  ProgressMeter - Starting traversal
    23:01:52.051 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute

     

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Lucas Taniguti, do you have any updates about this issue? Were you able to get it to work since you posted?

    0
    Comment actions Permalink
  • Avatar
    Lucas Taniguti

    Hi Genevieve-Brandt-she-her, after 1h of waiting I decide to kill the process. When doing it again I notice an error from TileDB during the GenomicsDBImport step. Here are the gatk logs:

    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10g -Xms5g -jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar GenomicsDBImport --batch-size 50 --genomicsdb-workspace-path gs://mybucket/genomicsdb4 -L chr20 -V gvcfs/ABL704-002.g.vcf.gz
    21:10:31.245 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    21:10:31.589 INFO GenomicsDBImport - ------------------------------------------------------------
    21:10:31.589 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.9.0
    21:10:31.590 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
    21:10:31.590 INFO GenomicsDBImport - Executing as taniguti@phasing-shapeit4-taniguti on Linux v5.4.0-1036-gcp amd64
    21:10:31.590 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1+1-Ubuntu-0ubuntu1.20.04
    21:10:31.590 INFO GenomicsDBImport - Start Date/Time: February 2, 2021 at 9:10:31 PM UTC
    21:10:31.591 INFO GenomicsDBImport - ------------------------------------------------------------
    21:10:31.591 INFO GenomicsDBImport - ------------------------------------------------------------
    21:10:31.592 INFO GenomicsDBImport - HTSJDK Version: 2.23.0
    21:10:31.592 INFO GenomicsDBImport - Picard Version: 2.23.3
    21:10:31.593 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    21:10:31.593 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    21:10:31.604 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    21:10:31.604 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    21:10:31.604 INFO GenomicsDBImport - Deflater: IntelDeflater
    21:10:31.604 INFO GenomicsDBImport - Inflater: IntelInflater
    21:10:31.604 INFO GenomicsDBImport - GCS max retries/reopens: 20
    21:10:31.604 INFO GenomicsDBImport - Requester pays: disabled
    21:10:31.605 INFO GenomicsDBImport - Initializing engine
    21:10:32.172 INFO IntervalArgumentCollection - Processing 64444167 bp from intervals
    21:10:32.175 INFO GenomicsDBImport - Done initializing engine
    21:10:32.688 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
    log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    21:10:36.052 INFO GenomicsDBImport - Vid Map JSON file will be written to gs://mybucket/genomicsdb4/vidmap.json
    21:10:36.053 INFO GenomicsDBImport - Callset Map JSON file will be written to gs://mybucket/genomicsdb4/callset.json
    21:10:36.053 INFO GenomicsDBImport - Complete VCF Header will be written to gs://mybucket/genomicsdb4/vcfheader.vcf
    21:10:36.053 INFO GenomicsDBImport - Importing to workspace - gs://mybucket/genomicsdb4
    21:10:36.053 INFO ProgressMeter - Starting traversal
    21:10:36.054 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
    21:10:40.520 INFO GenomicsDBImport - Importing batch 1 with 1 samples
    [TileDB::FileSystem] Error: hdfs: Cannot list contents of dir gs://mybucket/genomicsdb4/chr20$1$64444167/genomicsdb_meta_dir
    21:11:06.646 INFO ProgressMeter - chr20:1 0.5 1 2.0
    21:11:06.646 INFO GenomicsDBImport - Done importing batch 1/1
    21:11:06.647 INFO ProgressMeter - chr20:1 0.5 1 2.0
    21:11:06.647 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 0.5 minutes.
    21:11:06.647 INFO GenomicsDBImport - Import of all batches to GenomicsDB completed!
    21:11:06.647 INFO GenomicsDBImport - Shutting down engine
    [February 2, 2021 at 9:11:06 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.59 minutes.
    Runtime.totalMemory()=5372903424
    Tool returned:
    true

    Is this expected? Now I think that SelectVariants is failing (i.e.: taking too long) because of this.

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Lucas Taniguti,

    We can not tell at this point if this error is causing the performance issue. 

    First, could you check if the issue is with the GenomicsDBImport workspace? Download the directory created by GenomicsDBImport and run SelectVariants locally. If it fails, there might be an issue with the workspace.

    Second, is there a chance that SelectVariants may take longer than you are expecting? Maybe wait a day or so to see if it completes. What is the size of your GenomicsDB? If it is large, then you may be cutting it off before it can finish.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Lucas Taniguti

    Hi Genevieve-Brandt-she-her,

    When I test using GenomicDB workspace locally it took 0.15 minutes, very fast!

    My testing dataset is of a single sample exome data (g.vcf.gz), selecting only chromosome 20 (98576 positions). I've tried both, SelectVariants and GenotypeGVCFs.

    Do you think it is necessary to leave it running for a day even with such a small dataset? Here is how I executed what you proposed:

    # Download locally
    gsutil cp -r gs://genomicsdb-test/db .
    # Execute SelectVariants
    ./gatk-4.1.9.0/gatk --java-options "-Xmx7g -Xms5g" SelectVariants -R Homo_sapiens_assembly38.fasta -V gendb://db  -L chr20 -O test.vcf.gz

    (...)
    23:05:43.863 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
    23:05:43.931 info  NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field AS_InbreedingCoeff  - the field will NOT be part of INFO fields in the generated VCF records
    23:05:43.931 info  NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field AS_QD  - the field will NOT be part of INFO fields in the generated VCF records
    23:05:43.932 info  NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field DS  - the field will NOT be part of INFO fields in the generated VCF records
    23:05:43.932 info  NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field InbreedingCoeff  - the field will NOT be part of INFO fields in the generated VCF records
    23:05:43.932 info  NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field MLEAC  - the field will NOT be part of INFO fields in the generated VCF records
    23:05:43.932 info  NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field MLEAF  - the field will NOT be part of INFO fields in the generated VCF records
    23:05:44.262 INFO  IntervalArgumentCollection - Processing 64444167 bp from intervals
    23:05:44.304 INFO  SelectVariants - Done initializing engine
    23:05:44.495 INFO  ProgressMeter - Starting traversal
    23:05:44.496 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
    GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),2.764708809999966,Cpu time(s),2.5340146019999557
    23:05:51.449 INFO  ProgressMeter -       chr20:64048842              0.1                 98576         850771.0
    23:05:51.452 INFO  ProgressMeter - Traversal complete. Processed 98576 total variants in 0.1 minutes.
    23:05:51.464 INFO  SelectVariants - Shutting down engine
    [February 3, 2021 at 11:05:51 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.15 minutes.
    Runtime.totalMemory()=5372903424

    Best,

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Lucas Taniguti,

    How large is your GenomicsDB directory?

    I have created a ticket on our github to look into this issue further, you can follow along here: https://github.com/broadinstitute/gatk/issues/7070

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Lucas Taniguti,

    I followed up again with my team and we were wondering if you try some steps so that we could diagnose where the problem is occurring.  Could you run Jstack with the process ID while the command is running and see what is going on? Do it a few different times and post here what you see. We want to know if the process is hung and not progressing, or if it is very very slow.

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Lucas Taniguti

    Hi Genevieve-Brandt-she-her, thank you for the help. 

    Here is the Jstack outputs (.tar.gz): https://drive.google.com/file/d/1OCfsGpzYxY6LlPkOI-ECSVrFgfMRE-WJ/view

    New note: when I store two samples in the same command (GenomicsDBImport and --genomicsdb-workspace-path) GenotypeGVCFs successfully completes in 14 minutes, but when I store one and later the other (using --genomicsdb-update-workspace-path) the GenotypeGVCFs process seems to hung (logs from .tar.gz).

    The produced GenomicsDB has 2.6Mb. For testing purposes I store only chr20 of two exomes. Locally the GenotypeGVCFs program complete in less than one minute.

    Note: for this new test I was using a VM with 2 CPUs and 8GB of memory.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Lucas Taniguti just wanted to let you know we are still looking into this and will get back to you when we have updates.

    1
    Comment actions Permalink
  • Avatar
    Lucas Taniguti

    Thank you Genevieve-Brandt-she-her , please let me know if you need more details.

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Lucas Taniguti,

    For the new note you gave above where the SelectVariants command only has the issue when you have updated the GenomicsDB, do you see anything like this when running locally? 

    Best,

    Genevieve

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Lucas Taniguti,

    Our team wanted to follow up and see if we could get this problem resolved. Do you have any updates on your end?

    Would you be open to trying a debuggable version that we could send to you to run? This could help us to determine what is going wrong.

    Also, could you send in your GenomicsDB workspace as a bug report? We can then try it on our end and see if we can figure out the issue.

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Lucas Taniguti

    Hi Genevieve-Brandt-she-her , sorry for answering so late. I do not have updates.

    This week I'll try to prepare a working example on how to reproduce my issue on your end.

    Thank you.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Great, thank you!

    0
    Comment actions Permalink
  • Avatar
    Lucas Taniguti

    Hi Genevieve-Brandt-she-her

    Here is what I'm trying, step by step. Please let me know if more info is required.

    Inputs

    - Sample 1 g.vcf
    - Sample 2 g.vcf
    - Human reference genome (gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta)

    Used g.vcf files and produced databases are available in Google Drive:
    https://drive.google.com/file/d/1rq5p5YpDYY6n0OqTL05IaDAl8vlMhxNv/view?usp=sharing

    Using locally (works)

    SAMPLE1=gvcfs/sample-1-chr20.g.vcf.gz
    SAMPLE2=gvcfs/sample-2-chr20.g.vcf.gz
    DB=my-local-database
    GATK=../gatk-4.1.9.0/gatk

    $GATK --java-options "-Xmx10g -Xms5g" \
    GenomicsDBImport \
    --genomicsdb-workspace-path $DB \
    -L chr20 \
    -V $SAMPLE1

    # [March 14, 2021 at 12:59:13 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.07 minutes.
    # Runtime.totalMemory()=5372903424
    # Tool returned:
    # true

    $GATK --java-options "-Xmx10g -Xms5g" \
    GenomicsDBImport \
    --genomicsdb-update-workspace-path $DB \
    -L chr20 \
    -V $SAMPLE2

    # [March 14, 2021 at 12:59:42 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.07 minutes.
    # Runtime.totalMemory()=5372903424
    # Tool returned:
    # true

    $GATK --java-options "-Xmx10g -Xms5g" \
    SelectVariants \
    -R ../Homo_sapiens_assembly38.fasta \
    -V gendb://$DB \
    -L chr20 -O test.vcf.gz

    # [March 14, 2021 at 1:00:20 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.10 minutes.
    # Runtime.totalMemory()=5372903424


    ## Using Google Cloud Storage (still not working)

    SAMPLE1=gvcfs/sample-1-chr20.g.vcf.gz
    SAMPLE2=gvcfs/sample-2-chr20.g.vcf.gz
    DB=genomicsdb-test/my-gcs-database
    export GOOGLE_APPLICATION_CREDENTIALS=SA-secret.json

    $GATK --java-options "-Xmx10g -Xms5g" \
    GenomicsDBImport \
    --genomicsdb-workspace-path gs://$DB \
    -L chr20 \
    -V $SAMPLE1

    # [March 14, 2021 at 12:42:51 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.60 minutes.
    # Runtime.totalMemory()=5372903424
    # Tool returned:
    # true

    $GATK --java-options "-Xmx10g -Xms5g" \
    GenomicsDBImport \
    --genomicsdb-update-workspace-path gs://$DB \
    -L chr20 \
    -V $SAMPLE2

    # [March 14, 2021 at 12:44:24 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.55 minutes.
    # Runtime.totalMemory()=5372903424
    # Tool returned:
    # true

    $GATK --java-options "-Xmx10g -Xms5g" \
    SelectVariants \
    -R ../Homo_sapiens_assembly38.fasta \
    -V gendb.gs://$DB \
    -L chr20 -O test2.vcf.gz

    # 12:46:02.138 INFO SelectVariants - Done initializing engine
    # 12:46:02.306 INFO ProgressMeter - Starting traversal
    # 12:46:02.307 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    # ---and hangs here

    Database sizes:

    gsutil du -sh gs://genomicsdb-test/my-gcs-database
    # 1.31 MiB gs://genomicsdb-test/my-gcs-database

    du -sh my-local-database/
    # 1.6M my-local-database/

    If I copy my-gcs-database to my local filesystem it works:

    gsutil cp -r gs://genomicsdb-test/my-gcs-database .
    $GATK --java-options "-Xmx10g -Xms5g" SelectVariants \
    -R ../Homo_sapiens_assembly38.fasta \
    -V gendb://my-gcs-database -L chr20 \
    -O test2.vcf.gz

    # [March 14, 2021 at 1:02:36 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.10 minutes.
    # Runtime.totalMemory()=5372903424

    du -sh my-local-database my-gcs-database
    # 1.6M my-local-database
    # 1.6M my-gcs-database

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Lucas Taniguti,

    Thank you for the update, our team is looking into this.

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Lucas Taniguti,

    The dev team is continuing to work on this issue and will post updates on the github issue ticket. Please stay tuned to this link for updates: https://github.com/broadinstitute/gatk/issues/7070

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Nalini Ganapati

    Hello Lucas Taniguti,

    I have a solution for the GCS issue. Just wondering if you could quickly try it out if I send you the native genomicsdb library with instructions to run it with gatk and java-options? I will need the OS flavor(Ubuntu or Linux) and version you are running on to build the library.

    0
    Comment actions Permalink
  • Avatar
    Lucas Taniguti

    Hello Nalini Ganapati, yes, I can try it out. This are the information about the system I'm running:

     

    ~$ lsb_release -a
    No LSB modules are available.
    Distributor ID: Ubuntu
    Description: Ubuntu 20.04.1 LTS
    Release: 20.04
    Codename: focal

    ~$ uname -a
    Linux phasing-shapeit4-taniguti 5.4.0-1040-gcp #43-Ubuntu SMP Fri Mar 19 17:49:48 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

    Thank you.

    0
    Comment actions Permalink
  • Avatar
    Nalini Ganapati

    Here is the link to the genomicsdb library. After downloading, here is a sample command for invoking gatk by pointing to the directory of the library - 

    ./gatk-4.1.9.0/gatk --java-options "-Dgenomicsdb.library.path=<library_dir>" SelectVariants -R Homo_sapiens_assembly38.fasta -V gendb.gs://my_bucket/my-gcs-database -L chr20 -O test.vcf.gz

    Please let me know if it works or if I have to fine tune the Google Cloud Storage credentials and other GCS settings further.

    0
    Comment actions Permalink
  • Avatar
    Lucas Taniguti

    Now I get a "fatal error":

    $ GOOGLE_APPLICATION_CREDENTIALS=/home/my/secret.json $GATK --java-options "-Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom" SelectVariants -R ../Homo_sapiens_assembly38.fasta -V gendb.gs://$DB -L chr20 -O test.vcf.gz
    Using GATK jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom -jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar SelectVariants -R ../Homo_sapiens_assembly38.fasta -V gendb.gs://genomicsdb-test/my-gcs-database2 -L chr20 -O test.vcf.gz
    17:55:18.447 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    17:55:18.729 INFO SelectVariants - ------------------------------------------------------------
    17:55:18.730 INFO SelectVariants - The Genome Analysis Toolkit (GATK) v4.1.9.0
    17:55:18.730 INFO SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
    17:55:18.730 INFO SelectVariants - Executing as taniguti@phasing-shapeit4-taniguti on Linux v5.4.0-1040-gcp amd64
    17:55:18.731 INFO SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v11.0.10+9-Ubuntu-0ubuntu1.20.04
    17:55:18.731 INFO SelectVariants - Start Date/Time: March 29, 2021 at 5:55:18 PM UTC
    17:55:18.731 INFO SelectVariants - ------------------------------------------------------------
    17:55:18.731 INFO SelectVariants - ------------------------------------------------------------
    17:55:18.732 INFO SelectVariants - HTSJDK Version: 2.23.0
    17:55:18.733 INFO SelectVariants - Picard Version: 2.23.3
    17:55:18.733 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    17:55:18.733 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    17:55:18.733 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    17:55:18.733 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    17:55:18.734 INFO SelectVariants - Deflater: IntelDeflater
    17:55:18.734 INFO SelectVariants - Inflater: IntelInflater
    17:55:18.734 INFO SelectVariants - GCS max retries/reopens: 20
    17:55:18.734 INFO SelectVariants - Requester pays: disabled
    17:55:18.735 INFO SelectVariants - Initializing engine
    17:55:20.469 INFO GenomicsDBLibLoader - GenomicsDB native library has been loaded from /home/taniguti/genomicsdb/custom/libtiledbgenomicsdb.so
    17:55:20.469 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.0-SNAPSHOT-06362e1
    [TileDB::StorageManagerConfig] Error: GCS FS only supports already existing buckets. Failed to locate bucket=genomicsdb-test Permanent error in GetBucketMetadata: EasyPerform() - CURL error [77]=Problem with the SSL CA cert (path? access rights?) [UNKNOWN]: Input/output error.
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    # SIGSEGV (0xb) at pc=0x00007efc796f15e1, pid=8194, tid=8195
    #
    # JRE version: OpenJDK Runtime Environment (11.0.10+9) (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
    # Java VM: OpenJDK 64-Bit Server VM (11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
    # Problematic frame:
    # C [libtiledbgenomicsdb.so+0x5cf5e1] StorageManagerConfig::~StorageManagerConfig()+0x21
    #
    # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/taniguti/genomicsdb/core.8194)
    #
    # An error report file with more information is saved as:
    # /home/taniguti/genomicsdb/hs_err_pid8194.log
    #
    # If you would like to submit a bug report, please visit:
    # https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
    # The crash happened outside the Java Virtual Machine in native code.
    # See problematic frame for where to report the bug.
    #

    The message:
    > Failed to locate bucket=genomicsdb-test

    Suggests that I do not have the bucket, but it exists and my service account has admin rules.

    0
    Comment actions Permalink
  • Avatar
    Nalini Ganapati

    Lucas Taniguti I have placed another library here.This library tries a few well known paths on Linux machines, You could try that if you like, there will be some debugging statements like `CA Certs path=/etc/pki/tls/certs/ca-bundle.crt`. Please let me know if this fixes your issue.

    0
    Comment actions Permalink
  • Avatar
    Nalini Ganapati

    As before, after downloading the new library, please use your command as before -

    GOOGLE_APPLICATION_CREDENTIALS=/home/my/secret.json $GATK --java-options "-Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom" SelectVariants -R ../Homo_sapiens_assembly38.fasta -V gendb.gs://$DB -L chr20 -O test.vcf.gz
    0
    Comment actions Permalink
  • Avatar
    Lucas Taniguti

    Nalini Ganapati its working now, thank you!

    I test it with the subset I sent earlier and now the "mini-workflow" finish as expected.


    $GATK --java-options "-Xmx10g -Xms5g -Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom2" \
      GenomicsDBImport \
      --genomicsdb-workspace-path gs://$DB \
      -L chr20 \
      -V $SAMPLE1

    # 11:20:50.658 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 0.3 minutes.
    # 11:20:50.658 INFO GenomicsDBImport - Import completed!
    # 11:20:50.658 INFO GenomicsDBImport - Shutting down engine
    # [April 2, 2021 at 11:20:50 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.36 minutes.
    # Runtime.totalMemory()=5372903424
    # Tool returned:
    # true

    $GATK --java-options "-Xmx10g -Xms5g -Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom2" \
    GenomicsDBImport \
    --genomicsdb-update-workspace-path gs://$DB \
    -L chr20 \
    -V $SAMPLE2

    # 11:22:06.549 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 0.4 minutes.
    # 11:22:06.549 INFO GenomicsDBImport - Import completed!
    # 11:22:06.549 INFO GenomicsDBImport - Shutting down engine
    # [April 2, 2021 at 11:22:06 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.43 minutes.
    # Runtime.totalMemory()=5372903424
    # Tool returned:
    # true

    $GATK --java-options "-Xmx10g -Xms5g -Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom2" \
    SelectVariants \
    -R ../Homo_sapiens_assembly38.fasta \
    -V gendb.gs://$DB \
    -L chr20 -O test-final.vcf.gz

    # 11:23:08.145 INFO ProgressMeter - chr20:1375618 0.4 1000 2376.4
    # GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),1.0103075140000013,Cpu time(s),0.8786914609999955
    # 11:23:10.226 INFO ProgressMeter - chr20:64084138 0.5 26128 57365.3
    # 11:23:10.226 INFO ProgressMeter - Traversal complete. Processed 26128 total variants in 0.5 minutes.
    # 11:23:10.241 INFO SelectVariants - Shutting down engine
    # [April 2, 2021 at 11:23:10 AM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.72 minutes.
    # Runtime.totalMemory()=5372903424


    Now I'll try it with larger datasets.

     

    0
    Comment actions Permalink
  • Avatar
    Nalini Ganapati

    Thanks Lucas Taniguti. Let me know how it performs with larger datasets.

    0
    Comment actions Permalink
  • Avatar
    Lucas Taniguti

    Ok Nalini Ganapati. It's not done yet but I'll post here any update.

    0
    Comment actions Permalink
  • Avatar
    anr

    Lucas Taniguti Nalini Ganapati : I'm trying to use GenomicsDBImport on 238 wgs samples sequenced to 30X on all intervals on google cloud. Is there anything that I should keep in mind when implementing. As far as memory and hard-disk what would be the optimal settings to request based on your experience. Any insights are greatly appreciated. Thank you.

     

    0
    Comment actions Permalink
  • Avatar
    Nalini Ganapati

    Are the vcfs on google cloud? Do you want to create a GenomicsDB workspace on google cloud or locally? Also, just wondering if you could open a new forum post as this deals with GenomicsDBImport and not SelectVariants?

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk