How to compress VCF for IndexFeatureFile to avoid MalformedFeatureFile error
Hi. I have got gzip-compressed VCF file 1060142146_S31.vcf.gz using HaplotypeCaller. IndexFeatureFile works fine on 1060142146_S31.vcf.gz compressed by HaplotypeCaller. But after I inflated the same VCF file with gunzip and compressed with gzip, IndexFeatureFile exits with error when I try to index newly compressed 1060142146_S31.vcf.gz. How should I inflate and compress VCF files in a way that does not trigger an error?
GATK version:
The Genome Analysis Toolkit (GATK) v4.6.1.0
HTSJDK Version: 4.1.3
Picard Version: 3.3.0
Gzip version: 1.12
Gunzip version: 1.12
OS version: Ubuntu 24.04.1 LTS
Exact command used:
gatk IndexFeatureFile --input 1060142146_S31.vcf.gz
Entire program log:
Using GATK jar /home/administrator/tools/gatk/gatk-package-4.6.1.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/administrator/tools/gatk/gatk-package-4.6.1.0-local.jar IndexFeatureFile --input 1060142146_S31.vcf.gz
13:28:23.495 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/administrator/tools/gatk/gatk-package-4.6.1.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
SLF4J(W): Class path contains multiple SLF4J providers.
SLF4J(W): Found provider [org.apache.logging.slf4j.SLF4JServiceProvider@34279b8a]
SLF4J(W): Found provider [ch.qos.logback.classic.spi.LogbackServiceProvider@687389a6]
SLF4J(W): See https://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J(I): Actual provider is of type [org.apache.logging.slf4j.SLF4JServiceProvider@34279b8a]
13:28:23.727 INFO IndexFeatureFile - ------------------------------------------------------------
13:28:23.730 INFO IndexFeatureFile - The Genome Analysis Toolkit (GATK) v4.6.1.0
13:28:23.731 INFO IndexFeatureFile - For support and documentation go to https://software.broadinstitute.org/gatk/
13:28:23.731 INFO IndexFeatureFile - Executing as administrator@compute-vm-8-32-500-hdd-1736852799422 on Linux v6.8.0-51-generic amd64
13:28:23.731 INFO IndexFeatureFile - Java runtime: OpenJDK 64-Bit Server VM v17.0.14+7-Ubuntu-124.04
13:28:23.731 INFO IndexFeatureFile - Start Date/Time: February 19, 2025 at 1:28:23 PM ALMT
13:28:23.731 INFO IndexFeatureFile - ------------------------------------------------------------
13:28:23.731 INFO IndexFeatureFile - ------------------------------------------------------------
13:28:23.732 INFO IndexFeatureFile - HTSJDK Version: 4.1.3
13:28:23.732 INFO IndexFeatureFile - Picard Version: 3.3.0
13:28:23.733 INFO IndexFeatureFile - Built for Spark Version: 3.5.0
13:28:23.735 INFO IndexFeatureFile - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:28:23.735 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:28:23.736 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:28:23.736 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:28:23.736 INFO IndexFeatureFile - Deflater: IntelDeflater
13:28:23.736 INFO IndexFeatureFile - Inflater: IntelInflater
13:28:23.737 INFO IndexFeatureFile - GCS max retries/reopens: 20
13:28:23.737 INFO IndexFeatureFile - Requester pays: disabled
13:28:23.737 INFO IndexFeatureFile - Initializing engine
13:28:23.737 INFO IndexFeatureFile - Done initializing engine
13:28:23.829 INFO FeatureManager - Using codec VCFCodec to read file file:///home/administrator/varcall_results/gatk/1060142146_S31.vcf.gz
13:28:23.833 INFO IndexFeatureFile - Shutting down engine
[February 19, 2025 at 1:28:23 PM ALMT] org.broadinstitute.hellbender.tools.IndexFeatureFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=201326592
***********************************************************************
A USER ERROR has occurred: Error while trying to create index for 1060142146_S31.vcf.gz. Error was: htsjdk.tribble.TribbleException.MalformedFeatureFile: Input file is not in valid block compressed format., for input source: /home/administrator/varcall_results/gatk/1060142146_S31.vcf.gz
***********************************************************************
-
Our tools and bunch of other HTS related tools do not use gzip directly. Instead they use a modified block compression algorithm called bgzip. HaplotypeCaller compresses using this therefore indexing works. You may be able to use bgzip tool that comes with samtools or bcftools or can be found in many mainstream linux distro repos.
Regards.
-
Thank you!
Please sign in to leave a comment.
2 comments