GATK4.0.3.0 GenotypeGVCFs - Could not open array genomicsdbarray
a) GATK version used: GATK 4.0.3.0
b) Exact command used:
[Tool]: GenomicsDBImport
export TILEDB_DISABLE_FILE_LOCKING=1
time ${dir_tool_gatk}/gatk --java-options "-Xmx85g -Xms85g" GenomicsDBImport \
-R ${dir_refdata}/b37_human_g1k_v37_decoy.fasta \
--sample-name-map ${dir_CombineGVCFs}/S2_cohort.sample_map \
--genomicsdb-workspace-path ${dir_CombineGVCFs}/temporary/tmp4 \
--TMP_DIR ${dir_CombineGVCFs}/temporary \
--intervals ${dir_CombineGVCFs}/intervals/bed3_tmp.intervals \
--reader-threads 5 \
--batch-size 50
[output]:
# folders and files in
# --genomicsdb-workspace-path ${dir_CombineGVCFs}/temporary/tmp4
callset.json
genomicsdb_array
__tiledb_workspace.tdb
vcfheader.vcf
vidmap.json
[Tool]: GenotypeGVCFs
export TILEDB_DISABLE_FILE_LOCKING=1
time ${dir_tool_gatk}/gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R ${dir_refdata}/b37_human_g1k_v37_decoy.fasta \
-V gendb://${dir_GenomicsDBImport}/tmp4 \
-O ${dir_GenotypeVCFs}/tmp4.vcf.gz
c) Entire error log:
Using GATK jar /home/projects/bin/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false
-Dsamjdk.use_async_io_write_samtools=true
-Dsamjdk.use_async_io_write_tribble=false
-Dsamjdk.compression_level=2
-Xmx4g -jar /home/projects/bin/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar
GenotypeGVCFs -R /home/reference_hg19/b37_human_g1k_v37_decoy.fasta
-V gendb:///home/WES-VCFQC/S2_GenomicsDBImport/temporary/tmp4
-O /home/WES-VCFQC/S2_GenomicsDBImport/VCF/tmp4.vcf.gz
12:52:15.187 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/projects/bin/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
12:52:16.266 INFO GenotypeGVCFs - ------------------------------------------------------------
12:52:16.267 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.3.0
12:52:16.267 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
12:52:16.267 INFO GenotypeGVCFs - Executing as XX@XX on Linux v2.6.32-754.14.2.el6.x86_64 amd64
12:52:16.267 INFO GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_91-b14
12:52:16.267 INFO GenotypeGVCFs - Start Date/Time: August 23, 2021 12:52:14 PM SGT
12:52:16.268 INFO GenotypeGVCFs - ------------------------------------------------------------
12:52:16.268 INFO GenotypeGVCFs - ------------------------------------------------------------
12:52:16.268 INFO GenotypeGVCFs - HTSJDK Version: 2.14.3
12:52:16.269 INFO GenotypeGVCFs - Picard Version: 2.17.2
12:52:16.269 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:52:16.269 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:52:16.269 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:52:16.269 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:52:16.269 INFO GenotypeGVCFs - Deflater: IntelDeflater
12:52:16.269 INFO GenotypeGVCFs - Inflater: IntelInflater
12:52:16.269 INFO GenotypeGVCFs - GCS max retries/reopens: 20
12:52:16.269 INFO GenotypeGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
12:52:16.269 INFO GenotypeGVCFs - Initializing engine
terminate called after throwing an instance of 'VariantQueryProcessorException'
what(): VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /home/WES-VCFQC/S2_GenomicsDBImport/temporary/tmp4
Hi, I used GenomicsDBImport to combined 2000 GVCFs. To speed up, I split the bed file and concatenated multiple intervals into a contig. I also met the file locking problem which can be solved by setting TILEDB_DISABLE_FILE_LOCKING=1 in my Linux system. Currently, I experience some issues with GenotypeGVCFs in GATK version 4.0.3.0. It cannot open "genomicsdb_array" although the directory of genomicsdb_array does exist. I found someone else has reported this issue here: https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2018-04-11-2017-12-02/11184-Could-not-open-array-genomicsdbarray-at-workspace-from-GenotypeGVCFs-in-GATK-4000 , but except for using the latest version of GATK, it seems like there are no other solutions.
I was wondering that how do I fix the issues with GATK 4.0.3.0? Does anyone have a better solution?
I also tried GenotypeGVCFs in GATK 4.2.1.0, but there is a problem in terms of MQ calculation. So I think it's better to stick to the same GATK version in the whole workflow.
A USER ERROR has occurred: Bad input: Presence of '-RAW_MQ' annotation is detected.
This GATK version expects key RAW_MQandDP with a tuple of sum of squared MQ values and total reads over variant genotypes as the value.
This could indicate that the provided input was produced with an older version of GATK.
Use the argument '--allow-old-rms-mapping-quality-annotation-data' to override and attempt the deprecated MQ calculation.
There may be differences in how newer GATK versions calculate DP and MQ that may result in worse MQ results. Use at your own risk.
Another question is related to the fasta file:
I downloaded the reference data in the link of https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37 , when I noticed that this is an old database, I have already generated GVCF files. It seems like GenotypeGVCFs does not understand the FAI index file.
# error informaion
[E::fai_read] Could not understand FAI /home/users/nus/bizszl/scratch/WES-new/reference_hg19/b37_human_g1k_v37_decoy.fasta.fai line 1
[E::fai_load3] Failed to read FASTA index /home/users/nus/bizszl/scratch/WES-new/reference_hg19/b37_human_g1k_v37_decoy.fasta.fai
# FAI file
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 249250621 52 60 61
2 dna:chromosome chromosome:GRCh37:2:1:243199373:1 243199373 253404903 60 61
3 dna:chromosome chromosome:GRCh37:3:1:198022430:1 198022430 500657651 60 61
4 dna:chromosome chromosome:GRCh37:4:1:191154276:1 191154276 701980507 60 61
5 dna:chromosome chromosome:GRCh37:5:1:180915260:1 180915260 896320740 60 61
If I use the latest fasta data provided by the GATK team https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg19/v0;tab=objects?prefix=&forceOnObjectsSortingFiltering=false , GenotypeGVCFs can work without any issues. How do I do to make GenotypeGVCFs understand the "old" fasta data?
# new fasta index file
1 249250621 52 80 81
2 243199373 252366358 80 81
3 198022430 498605776 80 81
4 191154276 699103539 80 81
Thank you for your help!
-
Hi HT,
The error regarding opening the array file is expected with the older GATK version you are using. Additionally, you cannot combine different GATK versions due to changes to the MQ annotation which is why you are receiving this error. Would it be possible for you to upload your GATK version and start over with GenomicsDBImport? If it is not possible for you to do this, the GATK team could try to look into the problem to debug your specific issue.
In regards to the FAI index error, there should be a way for you to still use the old fasta data. Could you try deleting this file and re-indexing the fasta file using samtools? Please let me know if this does not answer your question.
Kind regards,
Pamela
-
Hi Pamela,
Thanks for your quick reply!
I reindexing the fast file using samtools as you suggested. Now it works with GenotypeGVCFs in GTAK version 4.2.1.0 using old fasta data.
I was wondering which is a good plan? Here are 3 scenarios below. Which can avoid making MQ annotation problems due to different GATK versions?
- start over from the BaseRecalibrator and ApplyBQSR step using the new GATK version, i.e. update the whole workflow.
- start over to recall gVCFs using the new GATK version in the HaplotypeCaller step.
- start over with GenomicsDBImport and in the previous steps can use GATK 4.0.3.0.
Previously, we already applied a workflow with GATK 4.0.3.0 to ~1000 WES samples. But the joint calling step was using CombineGVCFs. I want to continue using this GATK version and the two batches of samples can be combined properly. If none of the scenarios above works and I still want to use GATK version 4.0.3.0, could GATK teams kindly help to debug this specific issue?
Thank you so much for any help you can provide!!
All the Best, HT.
-
Hi HT,
I'm glad to hear that reindexing the file was successful! In regards to your workflow, any mixing of GATK versions would be prone to errors arising due to changes in annotations, calculations, and algorithms. I would suggest starting by using the newer GATK version from the earliest possible step in your workflow (i.e. scenario 1). This would minimize the possibility for errors. If this is not feasible for you or if you would like to continue using the older version, I can submit a Github ticket for the GATK to look into how you can do this.
Kind regards,
Pamela
-
Hi Pamela,
Understood, I should always use the same version of GATK. Thank you for your suggestions! They are really helpful!
Could you please help to submit a Github ticket? I would indeed like to use GATK 4.0.3.0 as previously we already generated 1K gVCF files on this version. It saves a lot of time if GenomicsDBImport and GenotypeGVCFs also work on this older GATK version. I hope this post and ticket could help other users as well.
Thank you again for your kind help Pamela!
All the Best, HT
-
-
Hi HT,
The GATK team has been working on your issue and I received the following update today:
"Our strong recommendation would be to upgrade the pipeline to a modern version of GATK, if at all possible.
4.0.3.0
is many years out of date at this point, and I'm not sure we'll be able to diagnose issues with the GenomicsDB version in use at that time. The user should also consider the many improvements and bug fixes to the HaplotypeCaller that have gone in since that version."It seems like there may be too many issues that you will run into using version 4.0.3.0 and it is very difficult for the team to pinpoint your initial error given how outdated the version is. Would it be possible for you to update your workflow and run the initial WES samples using the most recent GATK version? This would likely save you a lot of trouble downstream when using tools that have had significant bug fixes since version 4.0.3.0.
Kind regards,
Pamela
-
Hi Pamela,
OK, I see. The latest workflow is indeed a better choice.
Thank you again for your kind help!
Best, HT
-
Hi HT,
I'm glad I could help and I apologize that there was not a better solution for continuing with your existing workflow. I hope that updating the workflow is successful and gives you better results.
Best, Pamela
Please sign in to leave a comment.
8 comments