GenomicsDB Import issues
Dear support team at GATK,
I would like to test GenomicsDB to store VCFs obtained from Gencove (imputation software). I have a directory containing the compressed VCFs, the indexed tbi files (compressed and uncompressed) and the Human genome (.fasta, .fai and .dict). The human genome was obtained from the GATK reference bundle at: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg19/v0;tab=objects?prefix=&forceOnObjectsSortingFiltering=false .
-rw-rw-r-- 1 server server 15K dic 6 2019 Homo_sapiens_assembly19.dict
-rw-rw-r-- 1 server server 3,0G dic 6 2019 Homo_sapiens_assembly19.fasta
-rw-rw-r-- 1 server server 2,8K dic 6 2019 Homo_sapiens_assembly19.fasta.fai
-rw-rw-r-- 1 server server 7,3G jul 15 12:21 patient.vcf
-rw-rw-r-- 1 server server 1,1G jul 14 14:31 patient.vcf.gz
-rw-rw-r-- 1 server server 2,1M jul 14 14:32 patient.vcf.gz.tbi
-rw-rw-r-- 1 server server 2,0M jul 15 12:13 patient.vcf.gz.tbi.gz
I am trying to make GenomicsDB Import work, however I was unsuccessful so far. I am getting the following warning:
15:26:46.611 WARN GenomeLocParser - The available sequence dictionary does not contain a sequence length for contig (1). Skipping validation of the genome loc end coordinate (400000).
I have checked that the sequences in my VCF and the Human Genome .dict /.fasta are the same. Here is a head of the fasta file:
head Homo_sapiens_assembly19.dict
@HD VN:1.0 SO:unsorted
@SQ SN:1 LN:249250621 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta AS:GRCh37 M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens
@SQ SN:2 LN:243199373 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta AS:GRCh37 M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=07/07/2022 - 14:39:56
##source=Gencove SaaS v2.0.0 (v20220706t173443z)
##generatedby=Gencove
##reference=human/b37_g1k/v1_0_0
##imputationpanel=human/b37_g1k_full/v1_0_2
##gencovesampleid=fce4a01f-96ac-4454-9580-72baaf1df677
##contig=<ID=1>
##INFO=<ID=RAF,Number=A,Type=Float,Description="ALT allele frequency in the reference panel">
##INFO=<ID=AF,Number=A,Type=Float,Description="ALT allele frequency computed from DS/GP field across target samples">
##INFO=<ID=BUF,Number=A,Type=Integer,Description="Is it a variant site falling within buffer regions? (0=no/1=yes)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Unphased genotypes">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype posteriors">
##NMAIN=10
##FPLOIDY=2
##FILTER=<ID=LOWCONF,Description="Set if not true: MAX(FORMAT/GP)>0.9">
##FORMAT=<ID=RC,Number=1,Type=Integer,Description="Count of reads with REF allele">
##FORMAT=<ID=AC,Number=1,Type=Integer,Description="Count of reads with ALT allele">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##bcftools_annotateVersion=1.8+htslib-1.8
##bcftools_annotateCommand=annotate -a gls.0.0.vcf.gz -c FORMAT/RC,FORMAT/AC,FORMAT/DP -Oz -o glimpse.0.0.vcf.gz unannotated.0.0.vcf.gz; Date=Thu Jul 7 14:39:58 2022
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LUSE03081974
1 10177 rs367896724 A AC . LOWCONF RAF=0.425319;AF=0.4455;BUF=0 GT:DS:GP 0/1:0.891:0.247,0.615,0.138
1 10235 rs540431307 T TA . PASS RAF=0.00119808;AF=0;BUF=0 GT:DS:GP 0/0:0:1,0,0
1 10352 rs555500075 T TA . LOWCONF RAF=0.4375;AF=0.4805;BUF=0 GT:DS:GP 0/1:0.961:0.257,0.523,0.219
1 10505 rs548419688 A T . PASS RAF=0.000199681;AF=0;BUF=0 GT:DS:GP:RC:AC:DP 0/0:0:1,0,0:2:0:2
REQUIRED for all errors and issues:
a) GATK version used:
docker run -v /data/genomicsdb/results/LUSE03081974/results:/home/files -it broadinstitute/gatk:latest
Using GATK jar /gatk/gatk-package-4.2.6.1-local.jar
b) Exact command used:
gatk GenomicsDBImport -V /home/files/patient.vcf.gz --genomicsdb-workspace-path my_database --intervals 1:100-400000 --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'
c) Entire program log:
Using GATK jar /gatk/gatk-package-4.2.6.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /gatk/gatk-package-4.2.6.1-local.jar GenomicsDBImport -V /home/files/patient.vcf.gz --genomicsdb-workspace-path my_database --intervals 1:100-400000
15:26:46.120 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
15:26:46.236 INFO GenomicsDBImport - ------------------------------------------------------------
15:26:46.236 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.6.1
15:26:46.236 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
15:26:46.237 INFO GenomicsDBImport - Executing as root@55936083761f on Linux v5.4.0-109-generic amd64
15:26:46.237 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
15:26:46.237 INFO GenomicsDBImport - Start Date/Time: July 15, 2022 3:26:46 PM GMT
15:26:46.237 INFO GenomicsDBImport - ------------------------------------------------------------
15:26:46.237 INFO GenomicsDBImport - ------------------------------------------------------------
15:26:46.238 INFO GenomicsDBImport - HTSJDK Version: 2.24.1
15:26:46.238 INFO GenomicsDBImport - Picard Version: 2.27.1
15:26:46.238 INFO GenomicsDBImport - Built for Spark Version: 2.4.5
15:26:46.238 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:26:46.238 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:26:46.238 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:26:46.238 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:26:46.238 INFO GenomicsDBImport - Deflater: IntelDeflater
15:26:46.238 INFO GenomicsDBImport - Inflater: IntelInflater
15:26:46.238 INFO GenomicsDBImport - GCS max retries/reopens: 20
15:26:46.238 INFO GenomicsDBImport - Requester pays: disabled
15:26:46.238 INFO GenomicsDBImport - Initializing engine
15:26:46.611 WARN GenomeLocParser - The available sequence dictionary does not contain a sequence length for contig (1). Skipping validation of the genome loc end coordinate (400000).
15:26:46.612 INFO IntervalArgumentCollection - Processing 399901 bp from intervals
15:26:46.613 INFO GenomicsDBImport - Done initializing engine
15:26:46.915 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
15:26:46.915 INFO GenomicsDBImport - Vid Map JSON file will be written to /gatk/my_database/vidmap.json
15:26:46.916 INFO GenomicsDBImport - Callset Map JSON file will be written to /gatk/my_database/callset.json
15:26:46.916 INFO GenomicsDBImport - Complete VCF Header will be written to /gatk/my_database/vcfheader.vcf
15:26:46.916 INFO GenomicsDBImport - Importing to workspace - /gatk/my_database
15:26:47.087 INFO GenomicsDBImport - Importing batch 1 with 1 samples
terminate called after throwing an instance of 'GenomicsDBConfigException'
what(): GenomicsDBConfigException : Position 100 queried for contig 1 which is of length 0; queried position is past end of contig
I don't know how to solve this or where the problem might be. I appreciate your suggestions. Thank you!
-
Hi aheritas,
Thank you for writing to the GATK forum! We hope that we can help you sort this out.
To start, please use the GATK tool, UpdateVCFSequenceDictionary, to re-header your vcf. You must provide a dictionary and the vcf you want to update; it will copy the dictionary into your vcf header. Ensure you run with --replace argument to ensure the original vcf gets overwritten.
Please let us know when you’ve done the above and have rerun GenomicsDB.
Best,
Anthony
Please sign in to leave a comment.
1 comment