How do I SelectVariants from GenomicsDB stored in GCS?
Hi,
I'm trying to use a GenomicsDB stored in Google Cloud Storage as the main "database" for our variants. It seems easy with GenomicDbImport, as follows:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/my/serviceaccount.json \
gatk-4.1.9.0/gatk --java-options "-Xmx10g -Xms5g" \
GenomicsDBImport --batch-size 50 \
--genomicsdb-update-workspace-path gs://my-bucket/genomicdb \
-L chr20 -V gvcfs/SAMPLE.g.vcf.gz
**Note:** using `--genomicsdb-update-workspace-path` because already executed with `--genomicsdb-workspace-path`.
No problem until here. The objects were successfully created as follows:
gs://my-bucket/genomicdb/
gs://my-bucket/genomicdb/__tiledb_workspace.tdb
gs://my-bucket/genomicdb/callset.json
gs://my-bucket/genomicdb/vcfheader.vcf
gs://my-bucket/genomicdb/vidmap.json
gs://my-bucket/genomicdb/chr20$1$64444167/
Is it possible to use SelectVariants or GenotypeGVCFs directly from this database stored Google Cloud Storage?
I've already tried:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/my/serviceaccount.json \
gatk-4.1.9.0/gatk --java-options "-Xmx10g -Xms5g" \
SelectVariants -R Homo_sapiens_assembly38.fasta \
-V gs://my-bucket/genomicdb \
-L chr20 \
-O test.vcf.gz
But I think that the program is trying to find a VCF file in the GCS and not a GenomicsDB. This is the error message:
Using GATK jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10g -Xms5g -jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar SelectVariants -R Homo_sapiens_assembly38.fasta -V gs://my-bucket/genomicdb -L chr20 -O test.vcf.gz
12:49:45.169 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
12:49:45.518 INFO SelectVariants - ------------------------------------------------------------
12:49:45.519 INFO SelectVariants - The Genome Analysis Toolkit (GATK) v4.1.9.0
12:49:45.519 INFO SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
12:49:45.519 INFO SelectVariants - Executing as taniguti@phasing-shapeit4-taniguti on Linux v5.4.0-1036-gcp amd64
12:49:45.519 INFO SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1+1-Ubuntu-0ubuntu1.20.04
12:49:45.520 INFO SelectVariants - Start Date/Time: January 31, 2021 at 12:49:45 PM UTC
12:49:45.520 INFO SelectVariants - ------------------------------------------------------------
12:49:45.520 INFO SelectVariants - ------------------------------------------------------------
12:49:45.521 INFO SelectVariants - HTSJDK Version: 2.23.0
12:49:45.521 INFO SelectVariants - Picard Version: 2.23.3
12:49:45.521 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:49:45.521 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:49:45.521 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:49:45.521 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:49:45.521 INFO SelectVariants - Deflater: IntelDeflater
12:49:45.521 INFO SelectVariants - Inflater: IntelInflater
12:49:45.522 INFO SelectVariants - GCS max retries/reopens: 20
12:49:45.522 INFO SelectVariants - Requester pays: disabled
12:49:45.522 INFO SelectVariants - Initializing engine
12:49:47.076 INFO SelectVariants - Shutting down engine
[January 31, 2021 at 12:49:47 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=5372903424
***********************************************************************
A USER ERROR has occurred: Couldn't read file gs://my-bucket/genomicdb. Error was: It isn't a regular file
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
The bucket does not have "Requester pays" activated, but also tried the parameter `--gcs-project-for-requester-pays` pointing to my cloud project name. No success.
Is it possible to run SelectVariants/GenotypeGVCFs this way? If so, how?
Thanks!
-
Hi Lucas Taniguti,
Yes, you run SelectVariants this way. For a GenomicsDB workspace in a bucket, you need to use the gendb-gs:// prefix to access it. It should work if you run it that way.
Genevieve
-
Thank you @Genevieve Brandt, it has started to work with gendb.gs://
But now I think it does not run. I have only one sample stored into the database and I'm selecting only chr20:1-1000000 and it is running for more than 30 minutes. Is it expected?
I'm using a VM from GCE, in the same region as the GCS bucket.
Using GATK jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10g -Xms5g -
jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar SelectVariants -R Homo_sapiens_assembly38.fasta -V gendb.gs://mybucket/genomicsdb -L chr20:1-1000000 -O teste.
vcf.gz
23:01:23.595 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compres
sion.so
23:01:23.914 INFO SelectVariants - ------------------------------------------------------------
23:01:23.915 INFO SelectVariants - The Genome Analysis Toolkit (GATK) v4.1.9.0
23:01:23.915 INFO SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
23:01:23.918 INFO SelectVariants - Executing as taniguti@phasing-shapeit4-taniguti on Linux v5.4.0-1036-gcp amd64
23:01:23.918 INFO SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1+1-Ubuntu-0ubuntu1.20.04
23:01:23.919 INFO SelectVariants - Start Date/Time: February 1, 2021 at 11:01:23 PM UTC
23:01:23.919 INFO SelectVariants - ------------------------------------------------------------
23:01:23.919 INFO SelectVariants - ------------------------------------------------------------
23:01:23.928 INFO SelectVariants - HTSJDK Version: 2.23.0
23:01:23.929 INFO SelectVariants - Picard Version: 2.23.3
23:01:23.929 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:01:23.929 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:01:23.929 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:01:23.929 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:01:23.930 INFO SelectVariants - Deflater: IntelDeflater
23:01:23.930 INFO SelectVariants - Inflater: IntelInflater
23:01:23.930 INFO SelectVariants - GCS max retries/reopens: 20
23:01:23.930 INFO SelectVariants - Requester pays: disabled
23:01:23.930 INFO SelectVariants - Initializing engine
23:01:25.939 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
23:01:39.847 info NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field AS_InbreedingCoeff - the field will NOT be part of INFO fields in the g
enerated VCF records
23:01:39.847 info NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field AS_QD - the field will NOT be part of INFO fields in the generated VCF
records
23:01:39.848 info NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF rec
ords
23:01:39.848 info NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the gene
rated VCF records
23:01:39.848 info NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF
records
23:01:39.848 info NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF
records
23:01:51.886 INFO IntervalArgumentCollection - Processing 1000000 bp from intervals
23:01:51.918 INFO SelectVariants - Done initializing engine
23:01:52.050 INFO ProgressMeter - Starting traversal
23:01:52.051 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute -
Hi Lucas Taniguti, do you have any updates about this issue? Were you able to get it to work since you posted?
-
Hi Genevieve-Brandt-she-her, after 1h of waiting I decide to kill the process. When doing it again I notice an error from TileDB during the GenomicsDBImport step. Here are the gatk logs:
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10g -Xms5g -jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar GenomicsDBImport --batch-size 50 --genomicsdb-workspace-path gs://mybucket/genomicsdb4 -L chr20 -V gvcfs/ABL704-002.g.vcf.gz
21:10:31.245 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
21:10:31.589 INFO GenomicsDBImport - ------------------------------------------------------------
21:10:31.589 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.9.0
21:10:31.590 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
21:10:31.590 INFO GenomicsDBImport - Executing as taniguti@phasing-shapeit4-taniguti on Linux v5.4.0-1036-gcp amd64
21:10:31.590 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1+1-Ubuntu-0ubuntu1.20.04
21:10:31.590 INFO GenomicsDBImport - Start Date/Time: February 2, 2021 at 9:10:31 PM UTC
21:10:31.591 INFO GenomicsDBImport - ------------------------------------------------------------
21:10:31.591 INFO GenomicsDBImport - ------------------------------------------------------------
21:10:31.592 INFO GenomicsDBImport - HTSJDK Version: 2.23.0
21:10:31.592 INFO GenomicsDBImport - Picard Version: 2.23.3
21:10:31.593 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
21:10:31.593 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
21:10:31.604 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
21:10:31.604 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
21:10:31.604 INFO GenomicsDBImport - Deflater: IntelDeflater
21:10:31.604 INFO GenomicsDBImport - Inflater: IntelInflater
21:10:31.604 INFO GenomicsDBImport - GCS max retries/reopens: 20
21:10:31.604 INFO GenomicsDBImport - Requester pays: disabled
21:10:31.605 INFO GenomicsDBImport - Initializing engine
21:10:32.172 INFO IntervalArgumentCollection - Processing 64444167 bp from intervals
21:10:32.175 INFO GenomicsDBImport - Done initializing engine
21:10:32.688 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
21:10:36.052 INFO GenomicsDBImport - Vid Map JSON file will be written to gs://mybucket/genomicsdb4/vidmap.json
21:10:36.053 INFO GenomicsDBImport - Callset Map JSON file will be written to gs://mybucket/genomicsdb4/callset.json
21:10:36.053 INFO GenomicsDBImport - Complete VCF Header will be written to gs://mybucket/genomicsdb4/vcfheader.vcf
21:10:36.053 INFO GenomicsDBImport - Importing to workspace - gs://mybucket/genomicsdb4
21:10:36.053 INFO ProgressMeter - Starting traversal
21:10:36.054 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
21:10:40.520 INFO GenomicsDBImport - Importing batch 1 with 1 samples
[TileDB::FileSystem] Error: hdfs: Cannot list contents of dir gs://mybucket/genomicsdb4/chr20$1$64444167/genomicsdb_meta_dir
21:11:06.646 INFO ProgressMeter - chr20:1 0.5 1 2.0
21:11:06.646 INFO GenomicsDBImport - Done importing batch 1/1
21:11:06.647 INFO ProgressMeter - chr20:1 0.5 1 2.0
21:11:06.647 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 0.5 minutes.
21:11:06.647 INFO GenomicsDBImport - Import of all batches to GenomicsDB completed!
21:11:06.647 INFO GenomicsDBImport - Shutting down engine
[February 2, 2021 at 9:11:06 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.59 minutes.
Runtime.totalMemory()=5372903424
Tool returned:
trueIs this expected? Now I think that SelectVariants is failing (i.e.: taking too long) because of this.
-
Hi Lucas Taniguti,
We can not tell at this point if this error is causing the performance issue.
First, could you check if the issue is with the GenomicsDBImport workspace? Download the directory created by GenomicsDBImport and run SelectVariants locally. If it fails, there might be an issue with the workspace.
Second, is there a chance that SelectVariants may take longer than you are expecting? Maybe wait a day or so to see if it completes. What is the size of your GenomicsDB? If it is large, then you may be cutting it off before it can finish.
Best,
Genevieve
-
Hi Genevieve-Brandt-she-her,
When I test using GenomicDB workspace locally it took 0.15 minutes, very fast!
My testing dataset is of a single sample exome data (g.vcf.gz), selecting only chromosome 20 (98576 positions). I've tried both, SelectVariants and GenotypeGVCFs.Do you think it is necessary to leave it running for a day even with such a small dataset? Here is how I executed what you proposed:
# Download locally
gsutil cp -r gs://genomicsdb-test/db .
# Execute SelectVariants
./gatk-4.1.9.0/gatk --java-options "-Xmx7g -Xms5g" SelectVariants -R Homo_sapiens_assembly38.fasta -V gendb://db -L chr20 -O test.vcf.gz
(...)
23:05:43.863 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
23:05:43.931 info NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field AS_InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
23:05:43.931 info NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field AS_QD - the field will NOT be part of INFO fields in the generated VCF records
23:05:43.932 info NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
23:05:43.932 info NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
23:05:43.932 info NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
23:05:43.932 info NativeGenomicsDB - pid=5755 tid=5756 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
23:05:44.262 INFO IntervalArgumentCollection - Processing 64444167 bp from intervals
23:05:44.304 INFO SelectVariants - Done initializing engine
23:05:44.495 INFO ProgressMeter - Starting traversal
23:05:44.496 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),2.764708809999966,Cpu time(s),2.5340146019999557
23:05:51.449 INFO ProgressMeter - chr20:64048842 0.1 98576 850771.0
23:05:51.452 INFO ProgressMeter - Traversal complete. Processed 98576 total variants in 0.1 minutes.
23:05:51.464 INFO SelectVariants - Shutting down engine
[February 3, 2021 at 11:05:51 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.15 minutes.
Runtime.totalMemory()=5372903424Best,
-
Hi Lucas Taniguti,
How large is your GenomicsDB directory?
I have created a ticket on our github to look into this issue further, you can follow along here: https://github.com/broadinstitute/gatk/issues/7070
Best,
Genevieve
-
Hi Lucas Taniguti,
I followed up again with my team and we were wondering if you try some steps so that we could diagnose where the problem is occurring. Could you run Jstack with the process ID while the command is running and see what is going on? Do it a few different times and post here what you see. We want to know if the process is hung and not progressing, or if it is very very slow.
Thank you,
Genevieve
-
Hi Genevieve-Brandt-she-her, thank you for the help.
Here is the Jstack outputs (.tar.gz): https://drive.google.com/file/d/1OCfsGpzYxY6LlPkOI-ECSVrFgfMRE-WJ/viewNew note: when I store two samples in the same command (GenomicsDBImport and --genomicsdb-workspace-path) GenotypeGVCFs successfully completes in 14 minutes, but when I store one and later the other (using --genomicsdb-update-workspace-path) the GenotypeGVCFs process seems to hung (logs from .tar.gz).
The produced GenomicsDB has 2.6Mb. For testing purposes I store only chr20 of two exomes. Locally the GenotypeGVCFs program complete in less than one minute.
Note: for this new test I was using a VM with 2 CPUs and 8GB of memory. -
Lucas Taniguti just wanted to let you know we are still looking into this and will get back to you when we have updates.
-
Thank you Genevieve-Brandt-she-her , please let me know if you need more details.
-
Hi Lucas Taniguti,
For the new note you gave above where the SelectVariants command only has the issue when you have updated the GenomicsDB, do you see anything like this when running locally?
Best,
Genevieve
-
Hi Lucas Taniguti,
Our team wanted to follow up and see if we could get this problem resolved. Do you have any updates on your end?
Would you be open to trying a debuggable version that we could send to you to run? This could help us to determine what is going wrong.
Also, could you send in your GenomicsDB workspace as a bug report? We can then try it on our end and see if we can figure out the issue.
Thank you,
Genevieve
-
Hi Genevieve-Brandt-she-her , sorry for answering so late. I do not have updates.
This week I'll try to prepare a working example on how to reproduce my issue on your end.
Thank you.
-
Great, thank you!
-
Here is what I'm trying, step by step. Please let me know if more info is required.
Inputs
- Sample 1 g.vcf
- Sample 2 g.vcf
- Human reference genome (gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta)Used g.vcf files and produced databases are available in Google Drive:
https://drive.google.com/file/d/1rq5p5YpDYY6n0OqTL05IaDAl8vlMhxNv/view?usp=sharingUsing locally (works)
SAMPLE1=gvcfs/sample-1-chr20.g.vcf.gz
SAMPLE2=gvcfs/sample-2-chr20.g.vcf.gz
DB=my-local-database
GATK=../gatk-4.1.9.0/gatk
$GATK --java-options "-Xmx10g -Xms5g" \
GenomicsDBImport \
--genomicsdb-workspace-path $DB \
-L chr20 \
-V $SAMPLE1
# [March 14, 2021 at 12:59:13 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.07 minutes.
# Runtime.totalMemory()=5372903424
# Tool returned:
# true
$GATK --java-options "-Xmx10g -Xms5g" \
GenomicsDBImport \
--genomicsdb-update-workspace-path $DB \
-L chr20 \
-V $SAMPLE2
# [March 14, 2021 at 12:59:42 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.07 minutes.
# Runtime.totalMemory()=5372903424
# Tool returned:
# true
$GATK --java-options "-Xmx10g -Xms5g" \
SelectVariants \
-R ../Homo_sapiens_assembly38.fasta \
-V gendb://$DB \
-L chr20 -O test.vcf.gz
# [March 14, 2021 at 1:00:20 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.10 minutes.
# Runtime.totalMemory()=5372903424
## Using Google Cloud Storage (still not working)SAMPLE1=gvcfs/sample-1-chr20.g.vcf.gz
SAMPLE2=gvcfs/sample-2-chr20.g.vcf.gz
DB=genomicsdb-test/my-gcs-database
export GOOGLE_APPLICATION_CREDENTIALS=SA-secret.json
$GATK --java-options "-Xmx10g -Xms5g" \
GenomicsDBImport \
--genomicsdb-workspace-path gs://$DB \
-L chr20 \
-V $SAMPLE1
# [March 14, 2021 at 12:42:51 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.60 minutes.
# Runtime.totalMemory()=5372903424
# Tool returned:
# true
$GATK --java-options "-Xmx10g -Xms5g" \
GenomicsDBImport \
--genomicsdb-update-workspace-path gs://$DB \
-L chr20 \
-V $SAMPLE2
# [March 14, 2021 at 12:44:24 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.55 minutes.
# Runtime.totalMemory()=5372903424
# Tool returned:
# true
$GATK --java-options "-Xmx10g -Xms5g" \
SelectVariants \
-R ../Homo_sapiens_assembly38.fasta \
-V gendb.gs://$DB \
-L chr20 -O test2.vcf.gz
# 12:46:02.138 INFO SelectVariants - Done initializing engine
# 12:46:02.306 INFO ProgressMeter - Starting traversal
# 12:46:02.307 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
# ---and hangs hereDatabase sizes:
gsutil du -sh gs://genomicsdb-test/my-gcs-database
# 1.31 MiB gs://genomicsdb-test/my-gcs-database
du -sh my-local-database/
# 1.6M my-local-database/If I copy my-gcs-database to my local filesystem it works:
gsutil cp -r gs://genomicsdb-test/my-gcs-database .
$GATK --java-options "-Xmx10g -Xms5g" SelectVariants \
-R ../Homo_sapiens_assembly38.fasta \
-V gendb://my-gcs-database -L chr20 \
-O test2.vcf.gz
# [March 14, 2021 at 1:02:36 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.10 minutes.
# Runtime.totalMemory()=5372903424
du -sh my-local-database my-gcs-database
# 1.6M my-local-database
# 1.6M my-gcs-database -
Hi Lucas Taniguti,
Thank you for the update, our team is looking into this.
Genevieve
-
Hi Lucas Taniguti,
The dev team is continuing to work on this issue and will post updates on the github issue ticket. Please stay tuned to this link for updates: https://github.com/broadinstitute/gatk/issues/7070
Thank you,
Genevieve
-
Hello Lucas Taniguti,
I have a solution for the GCS issue. Just wondering if you could quickly try it out if I send you the native genomicsdb library with instructions to run it with gatk and java-options? I will need the OS flavor(Ubuntu or Linux) and version you are running on to build the library.
-
Hello Nalini Ganapati, yes, I can try it out. This are the information about the system I'm running:
~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal
~$ uname -a
Linux phasing-shapeit4-taniguti 5.4.0-1040-gcp #43-Ubuntu SMP Fri Mar 19 17:49:48 UTC 2021 x86_64 x86_64 x86_64 GNU/LinuxThank you.
-
Here is the link to the genomicsdb library. After downloading, here is a sample command for invoking gatk by pointing to the directory of the library -
./gatk-4.1.9.0/gatk --java-options "-Dgenomicsdb.library.path=<library_dir>" SelectVariants -R Homo_sapiens_assembly38.fasta -V gendb.gs://my_bucket/my-gcs-database -L chr20 -O test.vcf.gz
Please let me know if it works or if I have to fine tune the Google Cloud Storage credentials and other GCS settings further.
-
Now I get a "fatal error":
$ GOOGLE_APPLICATION_CREDENTIALS=/home/my/secret.json $GATK --java-options "-Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom" SelectVariants -R ../Homo_sapiens_assembly38.fasta -V gendb.gs://$DB -L chr20 -O test.vcf.gz
Using GATK jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom -jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar SelectVariants -R ../Homo_sapiens_assembly38.fasta -V gendb.gs://genomicsdb-test/my-gcs-database2 -L chr20 -O test.vcf.gz
17:55:18.447 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:55:18.729 INFO SelectVariants - ------------------------------------------------------------
17:55:18.730 INFO SelectVariants - The Genome Analysis Toolkit (GATK) v4.1.9.0
17:55:18.730 INFO SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
17:55:18.730 INFO SelectVariants - Executing as taniguti@phasing-shapeit4-taniguti on Linux v5.4.0-1040-gcp amd64
17:55:18.731 INFO SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v11.0.10+9-Ubuntu-0ubuntu1.20.04
17:55:18.731 INFO SelectVariants - Start Date/Time: March 29, 2021 at 5:55:18 PM UTC
17:55:18.731 INFO SelectVariants - ------------------------------------------------------------
17:55:18.731 INFO SelectVariants - ------------------------------------------------------------
17:55:18.732 INFO SelectVariants - HTSJDK Version: 2.23.0
17:55:18.733 INFO SelectVariants - Picard Version: 2.23.3
17:55:18.733 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:55:18.733 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:55:18.733 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:55:18.733 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:55:18.734 INFO SelectVariants - Deflater: IntelDeflater
17:55:18.734 INFO SelectVariants - Inflater: IntelInflater
17:55:18.734 INFO SelectVariants - GCS max retries/reopens: 20
17:55:18.734 INFO SelectVariants - Requester pays: disabled
17:55:18.735 INFO SelectVariants - Initializing engine
17:55:20.469 INFO GenomicsDBLibLoader - GenomicsDB native library has been loaded from /home/taniguti/genomicsdb/custom/libtiledbgenomicsdb.so
17:55:20.469 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.0-SNAPSHOT-06362e1
[TileDB::StorageManagerConfig] Error: GCS FS only supports already existing buckets. Failed to locate bucket=genomicsdb-test Permanent error in GetBucketMetadata: EasyPerform() - CURL error [77]=Problem with the SSL CA cert (path? access rights?) [UNKNOWN]: Input/output error.
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007efc796f15e1, pid=8194, tid=8195
#
# JRE version: OpenJDK Runtime Environment (11.0.10+9) (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
# Java VM: OpenJDK 64-Bit Server VM (11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C [libtiledbgenomicsdb.so+0x5cf5e1] StorageManagerConfig::~StorageManagerConfig()+0x21
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/taniguti/genomicsdb/core.8194)
#
# An error report file with more information is saved as:
# /home/taniguti/genomicsdb/hs_err_pid8194.log
#
# If you would like to submit a bug report, please visit:
# https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#The message:
> Failed to locate bucket=genomicsdb-testSuggests that I do not have the bucket, but it exists and my service account has admin rules.
-
Lucas Taniguti I have placed another library here.This library tries a few well known paths on Linux machines, You could try that if you like, there will be some debugging statements like `CA Certs path=/etc/pki/tls/certs/ca-bundle.crt`. Please let me know if this fixes your issue.
-
As before, after downloading the new library, please use your command as before -
GOOGLE_APPLICATION_CREDENTIALS=/home/my/secret.json $GATK --java-options "-Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom" SelectVariants -R ../Homo_sapiens_assembly38.fasta -V gendb.gs://$DB -L chr20 -O test.vcf.gz
-
Nalini Ganapati its working now, thank you!
I test it with the subset I sent earlier and now the "mini-workflow" finish as expected.
$GATK --java-options "-Xmx10g -Xms5g -Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom2" \
GenomicsDBImport \
--genomicsdb-workspace-path gs://$DB \
-L chr20 \
-V $SAMPLE1
# 11:20:50.658 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 0.3 minutes.
# 11:20:50.658 INFO GenomicsDBImport - Import completed!
# 11:20:50.658 INFO GenomicsDBImport - Shutting down engine
# [April 2, 2021 at 11:20:50 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.36 minutes.
# Runtime.totalMemory()=5372903424
# Tool returned:
# true
$GATK --java-options "-Xmx10g -Xms5g -Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom2" \
GenomicsDBImport \
--genomicsdb-update-workspace-path gs://$DB \
-L chr20 \
-V $SAMPLE2
# 11:22:06.549 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 0.4 minutes.
# 11:22:06.549 INFO GenomicsDBImport - Import completed!
# 11:22:06.549 INFO GenomicsDBImport - Shutting down engine
# [April 2, 2021 at 11:22:06 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.43 minutes.
# Runtime.totalMemory()=5372903424
# Tool returned:
# true
$GATK --java-options "-Xmx10g -Xms5g -Dgenomicsdb.library.path=/home/taniguti/genomicsdb/custom2" \
SelectVariants \
-R ../Homo_sapiens_assembly38.fasta \
-V gendb.gs://$DB \
-L chr20 -O test-final.vcf.gz
# 11:23:08.145 INFO ProgressMeter - chr20:1375618 0.4 1000 2376.4
# GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),1.0103075140000013,Cpu time(s),0.8786914609999955
# 11:23:10.226 INFO ProgressMeter - chr20:64084138 0.5 26128 57365.3
# 11:23:10.226 INFO ProgressMeter - Traversal complete. Processed 26128 total variants in 0.5 minutes.
# 11:23:10.241 INFO SelectVariants - Shutting down engine
# [April 2, 2021 at 11:23:10 AM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.72 minutes.
# Runtime.totalMemory()=5372903424
Now I'll try it with larger datasets. -
Thanks Lucas Taniguti. Let me know how it performs with larger datasets.
-
Ok Nalini Ganapati. It's not done yet but I'll post here any update.
-
Lucas Taniguti Nalini Ganapati : I'm trying to use GenomicsDBImport on 238 wgs samples sequenced to 30X on all intervals on google cloud. Is there anything that I should keep in mind when implementing. As far as memory and hard-disk what would be the optimal settings to request based on your experience. Any insights are greatly appreciated. Thank you.
-
Are the vcfs on google cloud? Do you want to create a GenomicsDB workspace on google cloud or locally? Also, just wondering if you could open a new forum post as this deals with GenomicsDBImport and not SelectVariants?
Please sign in to leave a comment.
29 comments