GenomicsDBImport doesn't work with >2 samples
I am working with 8 GVCF files generated with HaplotypeCaller and indexed using bcftools. ValidateVariants runs without issue.
The below command prompts a "Failed to create reader from file" error.
gatk --java-options "-Xmx40g -Xms40g" GenomicsDBImport --genomicsdb-workspace-path g1out -L chr1 --reader-threads 15 -V 1176.rm.vcf.gz -V 14469.rm.vcf.gz -V 51566.rm.vcf.gz -V 70296.rm.vcf.gz -V 85693.rm.vcf.gz -V 8829.rm.vcf.gz -V 91697.rm.vcf.gz -V 96371.rm.vcf.gz
Using GATK jar /share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4-
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx40g -Xms40g -jar /share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4- GenomicsDBImport --genomicsdb-workspace-path g1out -L 20 --reader-threads 15 -V 1176.rm.vcf.gz -V 14469.rm.vcf.gz -V 51566.rm.vcf.gz -V 70296.rm.vcf.gz -V 85693.rm.vcf.gz -V 8829.rm.vcf.gz -V 91697.rm.vcf.gz -V 96371.rm.vcf.gz
11:36:04.355 INFO NativeLibraryLoader - Loading from jar:file:/share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4-!/com/intel/gkl/native/
Nov 15, 2022 11:36:04 AM runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
11:36:04.626 INFO GenomicsDBImport - ------------------------------------------------------------
11:36:04.626 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.8.1
11:36:04.626 INFO GenomicsDBImport - For support and documentation go to
11:36:04.626 INFO GenomicsDBImport - Executing as tr52w@c36b14 on Linux v2.6.32-754.35.1.el6.x86_64 amd64
11:36:04.626 INFO GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_77-b03
11:36:04.626 INFO GenomicsDBImport - Start Date/Time: November 15, 2022 11:36:04 AM EST
11:36:04.626 INFO GenomicsDBImport - ------------------------------------------------------------
11:36:04.626 INFO GenomicsDBImport - ------------------------------------------------------------
11:36:04.627 INFO GenomicsDBImport - HTSJDK Version: 2.23.0
11:36:04.627 INFO GenomicsDBImport - Picard Version: 2.22.8
11:36:04.627 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:36:04.627 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:36:04.627 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:36:04.627 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:36:04.627 INFO GenomicsDBImport - Deflater: IntelDeflater
11:36:04.627 INFO GenomicsDBImport - Inflater: IntelInflater
11:36:04.627 INFO GenomicsDBImport - GCS max retries/reopens: 20
11:36:04.627 INFO GenomicsDBImport - Requester pays: disabled
11:36:04.627 INFO GenomicsDBImport - Initializing engine
11:36:04.641 INFO GenomicsDBImport - Shutting down engine
[November 15, 2022 11:36:04 AM EST] done. Elapsed time: 0.01 minutes.
A USER ERROR has occurred: Failed to create reader from file:///project/bam/gatkout/1176.rm.vcf.gz
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Strangely, running two samples only does not prompt the same error:
gatk --java-options "-Xmx40g -Xms40g" GenomicsDBImport --genomicsdb-workspace-path g1out -L chr1 --reader-threads 15 -V 1176.rn.vcf.gz -V 14469.rn.vcf.gz
Using GATK jar /share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4-
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx40g -Xms40g -jar /share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4- GenomicsDBImport --genomicsdb-workspace-path g1out -L chr1 --reader-threads 15 -V 1176.rn.vcf.gz -V 14469.rn.vcf.gz
11:45:07.991 INFO NativeLibraryLoader - Loading from jar:file:/share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4-!/com/intel/gkl/native/
Nov 15, 2022 11:45:08 AM runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
11:45:08.156 INFO GenomicsDBImport - ------------------------------------------------------------
11:45:08.156 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.8.1
11:45:08.156 INFO GenomicsDBImport - For support and documentation go to
11:45:08.156 INFO GenomicsDBImport - Executing as tr52w@c36b14 on Linux v2.6.32-754.35.1.el6.x86_64 amd64
11:45:08.156 INFO GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_77-b03
11:45:08.156 INFO GenomicsDBImport - Start Date/Time: November 15, 2022 11:45:07 AM EST
11:45:08.156 INFO GenomicsDBImport - ------------------------------------------------------------
11:45:08.156 INFO GenomicsDBImport - ------------------------------------------------------------
11:45:08.157 INFO GenomicsDBImport - HTSJDK Version: 2.23.0
11:45:08.157 INFO GenomicsDBImport - Picard Version: 2.22.8
11:45:08.157 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:45:08.157 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:45:08.157 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:45:08.157 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:45:08.157 INFO GenomicsDBImport - Deflater: IntelDeflater
11:45:08.157 INFO GenomicsDBImport - Inflater: IntelInflater
11:45:08.157 INFO GenomicsDBImport - GCS max retries/reopens: 20
11:45:08.157 INFO GenomicsDBImport - Requester pays: disabled
11:45:08.157 INFO GenomicsDBImport - Initializing engine
11:45:08.491 INFO IntervalArgumentCollection - Processing 248956422 bp from intervals
11:45:08.492 INFO GenomicsDBImport - Done initializing engine
11:45:08.728 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.0-e701905
11:45:08.732 INFO GenomicsDBImport - Vid Map JSON file will be written to /project/bam/gatkout/g1out/vidmap.json
11:45:08.733 INFO GenomicsDBImport - Callset Map JSON file will be written to /project/bam/gatkout/g1out/callset.json
11:45:08.733 INFO GenomicsDBImport - Complete VCF Header will be written to /project/bam/gatkout/g1out/vcfheader.vcf
11:45:08.733 INFO GenomicsDBImport - Importing to workspace - /project/bam/gatkout/g1out
11:45:08.733 INFO ProgressMeter - Starting traversal
11:45:08.733 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
11:45:08.842 INFO GenomicsDBImport - Starting batch input file preload
11:45:08.868 INFO GenomicsDBImport - Finished batch preload
11:45:08.868 INFO GenomicsDBImport - Importing batch 1 with 2 samples
I don't see why sample 1176 would prompt an error only when run with more samples.
Is there a way to run GenotypeGVCFs without this step?
Is there a way to specify all chromosomes for required -L flag?
Hi Tomás Rodríguez,
Thank you for writing into the GATK forum so that we can help you with this question!
The good news is I think the reason your first command has an error is with a simple typo. You have 1176.rm.vcf.gz instead of 1176.rn.vcf.gz. When you fix that, it should work!
For your question about the intervals, yes, it is possible to specify all the chromosomes. There are multiple ways to do it and they are all explained in this article: Intervals and interval lists. Please let me know if you have any questions about that article and I can help out!
Thank you for your quick reply!
Embarrassingly, this was this issue. Apologies for that!
I specified the correct sample names and intervals according to the support page you linked.
Unfortunately, I thought that 8 hours would be more than enough time to run this job and it timed out. Is there a way to estimate the amount of time/resources I would need? Should I be concerned that combining 8 VCFs takes this long (maybe I haven't configured the batch option correctly)? Thanks again for the help.
gatk --java-options "-Xmx40g -Xms40g" GenomicsDBImport --genomicsdb-workspace-path g1out -L allchr.bed --reader-threads 15 --sample-name-map g1
Using GATK jar /share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4-
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx40g -Xms40g -jar /share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4- GenomicsDBImport --genomicsdb-workspace-path g1out -L allchr.bed --reader-threads 15 --sample-name-map g1
21:42:06.834 INFO NativeLibraryLoader - Loading from jar:file:/share/pkg/conda/2018-05-11/envs/gatk_4.1.8.1/share/gatk4-!/com/intel/gkl/native/
Nov 16, 2022 9:42:06 PM runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
21:42:06.996 INFO GenomicsDBImport - ------------------------------------------------------------
21:42:06.996 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.8.1
21:42:06.996 INFO GenomicsDBImport - For support and documentation go to
21:42:06.996 INFO GenomicsDBImport - Executing as tr52w@c40b08 on Linux v2.6.32-754.35.1.el6.x86_64 amd64
21:42:06.996 INFO GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_77-b03
21:42:06.996 INFO GenomicsDBImport - Start Date/Time: November 16, 2022 9:42:06 PM EST
21:42:06.996 INFO GenomicsDBImport - ------------------------------------------------------------
21:42:06.996 INFO GenomicsDBImport - ------------------------------------------------------------
21:42:06.997 INFO GenomicsDBImport - HTSJDK Version: 2.23.0
21:42:06.997 INFO GenomicsDBImport - Picard Version: 2.22.8
21:42:06.997 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
21:42:06.997 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
21:42:06.997 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
21:42:06.997 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
21:42:06.997 INFO GenomicsDBImport - Deflater: IntelDeflater
21:42:06.997 INFO GenomicsDBImport - Inflater: IntelInflater
21:42:06.997 INFO GenomicsDBImport - GCS max retries/reopens: 20
21:42:06.997 INFO GenomicsDBImport - Requester pays: disabled
21:42:06.997 INFO GenomicsDBImport - Initializing engine
21:42:07.289 INFO FeatureManager - Using codec BEDCodec to read file file:///project/umw/ATAC_output/bam/gatkout/allchr.bed
21:42:07.294 INFO IntervalArgumentCollection - Processing 3088269808 bp from intervals
21:42:07.296 INFO GenomicsDBImport - Done initializing engine
21:42:07.519 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.0-e701905
21:42:07.523 INFO GenomicsDBImport - Vid Map JSON file will be written to /project/umw/ATAC_output/bam/gatkout/g1out/vidmap.json
21:42:07.523 INFO GenomicsDBImport - Callset Map JSON file will be written to /project/umw/ATAC_output/bam/gatkout/g1out/callset.json
21:42:07.523 INFO GenomicsDBImport - Complete VCF Header will be written to /project/umw/ATAC_output/bam/gatkout/g1out/vcfheader.vcf
21:42:07.524 INFO GenomicsDBImport - Importing to workspace - /project/umw/ATAC_output/bam/gatkout/g1out
21:42:07.524 WARN GenomicsDBImport - GenomicsDBImport cannot use multiple VCF reader threads for initialization when the number of intervals is greater than 1. Falling back to serial VCF reader initialization.
21:42:07.524 INFO ProgressMeter - Starting traversal
21:42:07.524 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
21:42:07.757 INFO GenomicsDBImport - Importing batch 1 with 8 samples -
Great to hear that it's working! And yes, GenomicsDBImport can be a time consuming step. We put together an article with all of our performance recommendations here: GenomicsDBImport usage and performance guidelines. Let me know if you have any questions about those!
Hi Genevieve,
Thanks again for your replies. Unfortunately, my job could not complete after 36 hours. I'm new to using GATK and am getting the sense that I'm not using the workflow for its intended purpose. Was this designed to process very small intervals on single-replicate VCFs only? The extensive runtime and excessive memory usage seem like a red flag that I'm doing something wrong. I'd like to detect SNVs genome-wide across multiple replicates.
Was the job paused at 36 hours? Or was it in the middle of the run? You can share your program log and I can take a look to see if there were any issues. If it never started (like the program log you shared) there is a chance you do not have all the GenomicsDBImport requirements installed correctly. GenomicsDBImport is definitely meant for many VCFs and we have a lot of parameters to optimize GenomicsDBImport for your usage. Here are some options I recommend:
- If you want to run the import all at once, you can import your intervals separately to create multiple genomicsdb workspaces. After genotyping, these VCFs can be combined with MergeVCFs.
- If you want to run the import serially, you can import a few samples at a time, using the option --genomicsdb-update-workspace-path.
- If you are running on a shared filesystem, I recommend the argument --genomicsdb-shared-posixfs-optimizations to true.
Hi Tomás,
We haven't heard from you in a while so we're going to close out this ticket in our system. If you still require assistance, simply respond to this thread and we'll be happy to pick up where we left off!
Kind regards,
Please sign in to leave a comment.