Can not create an Index file for my gvcf
Hi, I am new in bioinformatics and using the GATK program. I am trying to analyze data from 125 samples for a GWAS study on Theobroma cacao. I got until Variant calling, where I applied the HaplotypeCaller command to all my bam files, then aggregated all the GVCFs into a single one using GenomicsDBImport. Finally used GenotypeGVCFs to this last output. This worked well without errors until I wanted to count the variants in my final file named "matina.combined.g.vcf.gz ". The following error appears:
a) GATK version used: 4.6.0.0
b) Exact command used:
gatk CountVariants -V Downloads/GWAS_cacao_data/usftp21.novogene.com/Data_analysis/7.variant_calling/Haplotype_caller/Matina/matina.combined.g.vcf.gz
Using GATK jar /gatk/gatk-package-4.6.0.0-local.jar
c) Entire program log:
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.6.0.0-local.jar Count
Variants -V Downloads/GWAS_cacao_data/usftp21.novogene.com/Data_analysis/7.variant_calling/Haplotype_caller/Matina/matina.combined.g.vcf.gz
14:05:41.702 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.6.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
14:05:41.907 INFO CountVariants - ------------------------------------------------------------
14:05:41.912 INFO CountVariants - The Genome Analysis Toolkit (GATK) v4.6.0.0
14:05:41.912 INFO CountVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
14:05:41.912 INFO CountVariants - Executing as root@81ae01d4dd9e on Linux v5.14.0-362.24.1.el9_3.x86_64 amd64
14:05:41.913 INFO CountVariants - Java runtime: OpenJDK 64-Bit Server VM v17.0.9+9-Ubuntu-122.04
14:05:41.913 INFO CountVariants - Start Date/Time: September 25, 2024 at 2:05:41 PM GMT
14:05:41.913 INFO CountVariants - ------------------------------------------------------------
14:05:41.913 INFO CountVariants - ------------------------------------------------------------
14:05:41.916 INFO CountVariants - HTSJDK Version: 4.1.1
14:05:41.916 INFO CountVariants - Picard Version: 3.2.0
14:05:41.917 INFO CountVariants - Built for Spark Version: 3.5.0
14:05:41.918 INFO CountVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:05:41.918 INFO CountVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:05:41.919 INFO CountVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:05:41.919 INFO CountVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:05:41.920 INFO CountVariants - Deflater: IntelDeflater
14:05:41.920 INFO CountVariants - Inflater: IntelInflater
14:05:41.921 INFO CountVariants - GCS max retries/reopens: 20
14:05:41.921 INFO CountVariants - Requester pays: disabled
14:05:41.922 INFO CountVariants - Initializing engine
14:05:42.077 INFO FeatureManager - Using codec VCFCodec to read file file:///gatk/cacao_GWAS/Downloads/GWAS_cacao_data/usftp21.novogene.com/Data_analysis/7.variant_calling/Haplotype_caller/Matina/matina.combine
d.g.vcf.gz
14:05:42.101 INFO CountVariants - Shutting down engine
[September 25, 2024 at 2:05:42 PM GMT] org.broadinstitute.hellbender.tools.walkers.CountVariants done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=201326592
***********************************************************************
A USER ERROR has occurred: An index is required but was not found for file drivingVariantFile:/gatk/cacao_GWAS/Downloads/GWAS_cacao_data/usftp21.novogene.com/Data_analysis/7.variant_calling/Haplotype_caller/Mati
na/matina.combined.g.vcf.gz. Support for unindexed block-compressed files has been temporarily disabled. Try running IndexFeatureFile on the input.
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
I tried to create the index file using the code:
gatk IndexFeatureFile -I Downloads/GWAS_cacao_data/usftp21.novogene.com/Data_analysis/7.variant_calling/Haplotype_caller/Matina/matina.combined.g.vcf.gz -O Downloads/GW
AS_cacao_data/usftp21.novogene.com/Data_analysis/7.variant_calling/Haplotype_caller/Matina/matina.genotypes.tbi
and the following error occurred:
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.6.0.0-local.jar Index
FeatureFile -I Downloads/GWAS_cacao_data/usftp21.novogene.com/Data_analysis/7.variant_calling/Haplotype_caller/Matina/matina.combined.g.vcf.gz -O Downloads/GWAS_cacao_data/usftp21.novogene.com/Data_analysis/7.va
riant_calling/Haplotype_caller/Matina/matina.genotypes.tbi
14:06:50.542 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.6.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
14:06:50.739 INFO IndexFeatureFile - ------------------------------------------------------------
14:06:50.743 INFO IndexFeatureFile - The Genome Analysis Toolkit (GATK) v4.6.0.0
14:06:50.743 INFO IndexFeatureFile - For support and documentation go to https://software.broadinstitute.org/gatk/
14:06:50.743 INFO IndexFeatureFile - Executing as root@81ae01d4dd9e on Linux v5.14.0-362.24.1.el9_3.x86_64 amd64
14:06:50.743 INFO IndexFeatureFile - Java runtime: OpenJDK 64-Bit Server VM v17.0.9+9-Ubuntu-122.04
14:06:50.744 INFO IndexFeatureFile - Start Date/Time: September 25, 2024 at 2:06:50 PM GMT
14:06:50.744 INFO IndexFeatureFile - ------------------------------------------------------------
14:06:50.744 INFO IndexFeatureFile - ------------------------------------------------------------
14:06:50.745 INFO IndexFeatureFile - HTSJDK Version: 4.1.1
14:06:50.746 INFO IndexFeatureFile - Picard Version: 3.2.0
14:06:50.746 INFO IndexFeatureFile - Built for Spark Version: 3.5.0
14:06:50.747 INFO IndexFeatureFile - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:06:50.747 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:06:50.747 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:06:50.747 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:06:50.748 INFO IndexFeatureFile - Deflater: IntelDeflater
14:06:50.748 INFO IndexFeatureFile - Inflater: IntelInflater
14:06:50.749 INFO IndexFeatureFile - GCS max retries/reopens: 20
14:06:50.749 INFO IndexFeatureFile - Requester pays: disabled
14:06:50.749 INFO IndexFeatureFile - Initializing engine
14:06:50.749 INFO IndexFeatureFile - Done initializing engine
14:06:50.883 INFO FeatureManager - Using codec VCFCodec to read file file:///gatk/cacao_GWAS/Downloads/GWAS_cacao_data/usftp21.novogene.com/Data_analysis/7.variant_calling/Haplotype_caller/Matina/matina.combine
d.g.vcf.gz
14:06:50.907 INFO ProgressMeter - Starting traversal
14:06:50.908 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute
14:07:00.924 INFO ProgressMeter - ENA|CM001879|CM001879.1:1059428 0.2 1031000 6182908.5
14:07:10.916 INFO ProgressMeter - ENA|CM001879|CM001879.1:2284491 0.3 2214000 6640008.0
14:07:20.924 INFO ProgressMeter - ENA|CM001879|CM001879.1:3505090 0.5 3384000 6765068.5
14:07:30.925 INFO ProgressMeter - ENA|CM001879|CM001879.1:4745146 0.7 4556000 6831608.9
14:07:40.929 INFO ProgressMeter - ENA|CM001879|CM001879.1:5948582 0.8 5685000 6819545.0
14:07:50.930 INFO ProgressMeter - ENA|CM001879|CM001879.1:7131316 1.0 6798000 6795734.8
14:08:00.935 INFO ProgressMeter - ENA|CM001879|CM001879.1:8182033 1.2 7791000 6675711.2
14:08:10.936 INFO ProgressMeter - ENA|CM001879|CM001879.1:9309647 1.3 8857000 6640591.8
14:08:20.937 INFO ProgressMeter - ENA|CM001879|CM001879.1:10472175 1.5 9916000 6608610.7
14:08:30.937 INFO ProgressMeter - ENA|CM001879|CM001879.1:11613689 1.7 10961000 6574759.1
14:08:40.940 INFO ProgressMeter - ENA|CM001879|CM001879.1:12717618 1.8 12003000 6545365.3
14:08:50.948 INFO ProgressMeter - ENA|CM001879|CM001879.1:13789005 2.0 12984000 6489998.9
14:09:00.971 INFO ProgressMeter - ENA|CM001879|CM001879.1:15038347 2.2 14068000 6489877.8
14:09:10.972 INFO ProgressMeter - ENA|CM001879|CM001879.1:16338375 2.3 15242000 6529347.5
14:09:20.976 INFO ProgressMeter - ENA|CM001879|CM001879.1:17482705 2.5 16297000 6515933.0
14:09:30.990 INFO ProgressMeter - ENA|CM001879|CM001879.1:18647247 2.7 17354000 6504538.4
14:09:40.994 INFO ProgressMeter - ENA|CM001879|CM001879.1:19806256 2.8 18428000 6500787.8
14:09:50.996 INFO ProgressMeter - ENA|CM001879|CM001879.1:20949560 3.0 19443000 6477941.0
14:10:00.999 INFO ProgressMeter - ENA|CM001879|CM001879.1:22079290 3.2 20468000 6460552.7
14:10:11.004 INFO ProgressMeter - ENA|CM001879|CM001879.1:23274952 3.3 21551000 6462230.4
14:10:21.011 INFO ProgressMeter - ENA|CM001879|CM001879.1:24480003 3.5 22645000 6466889.7
14:10:31.014 INFO ProgressMeter - ENA|CM001879|CM001879.1:25664581 3.7 23700000 6460611.6
14:10:41.016 INFO ProgressMeter - ENA|CM001879|CM001879.1:26937443 3.8 24869000 6484576.7
14:10:51.015 INFO ProgressMeter - ENA|CM001879|CM001879.1:28159203 4.0 25981000 6492382.5
14:11:01.022 INFO ProgressMeter - ENA|CM001879|CM001879.1:29301419 4.2 27040000 6486719.9
14:11:11.023 INFO ProgressMeter - ENA|CM001879|CM001879.1:30440921 4.3 28051000 6470495.5
14:11:21.023 INFO ProgressMeter - ENA|CM001879|CM001879.1:31621453 4.5 29153000 6475710.3
14:11:31.024 INFO ProgressMeter - ENA|CM001879|CM001879.1:32729388 4.7 30204000 6469628.5
14:11:41.030 INFO ProgressMeter - ENA|CM001879|CM001879.1:33740967 4.8 31158000 6443838.6
14:11:51.037 INFO ProgressMeter - ENA|CM001879|CM001879.1:34537125 5.0 31903000 6377942.5
14:12:01.037 INFO ProgressMeter - ENA|CM001879|CM001879.1:35771983 5.2 33079000 6399765.3
14:12:11.040 INFO ProgressMeter - ENA|CM001879|CM001879.1:37037626 5.3 34275000 6423952.8
14:12:21.046 INFO ProgressMeter - ENA|CM001879|CM001879.1:38176000 5.5 35363000 6427007.1
14:12:31.054 INFO ProgressMeter - ENA|CM001880|CM001880.1:429349 5.7 36538000 6445152.6
14:12:41.059 INFO ProgressMeter - ENA|CM001880|CM001880.1:1576154 5.8 37628000 6447787.8
14:12:51.060 INFO ProgressMeter - ENA|CM001880|CM001880.1:2812119 6.0 38813000 6466121.2
14:13:01.063 INFO ProgressMeter - ENA|CM001880|CM001880.1:4028042 6.2 39933000 6472962.5
14:13:11.066 INFO ProgressMeter - ENA|CM001880|CM001880.1:5245685 6.3 41101000 6486951.4
14:13:21.068 INFO ProgressMeter - ENA|CM001880|CM001880.1:6517307 6.5 42307000 6506133.4
14:13:31.074 INFO ProgressMeter - ENA|CM001880|CM001880.1:7765589 6.7 43483000 6519760.6
14:13:41.082 INFO ProgressMeter - ENA|CM001880|CM001880.1:8969231 6.8 44622000 6527294.6
14:13:51.086 INFO ProgressMeter - ENA|CM001880|CM001880.1:10218618 7.0 45708000 6526994.7
14:14:01.086 INFO ProgressMeter - ENA|CM001880|CM001880.1:11413645 7.2 46808000 6528676.6
14:14:11.088 INFO ProgressMeter - ENA|CM001880|CM001880.1:12626571 7.3 47952000 6536264.9
14:14:21.097 INFO ProgressMeter - ENA|CM001880|CM001880.1:13796790 7.5 49038000 6535669.5
14:14:31.103 INFO ProgressMeter - ENA|CM001880|CM001880.1:15012795 7.7 50166000 6540632.9
14:14:41.124 INFO ProgressMeter - ENA|CM001880|CM001880.1:16261438 7.8 51288000 6544410.5
14:14:51.131 INFO ProgressMeter - ENA|CM001880|CM001880.1:17466291 8.0 52420000 6549484.5
14:15:01.135 INFO ProgressMeter - ENA|CM001880|CM001880.1:18644136 8.2 53535000 6552297.4
14:15:11.141 INFO ProgressMeter - ENA|CM001880|CM001880.1:19874920 8.3 54626000 6552079.8
14:15:21.149 INFO ProgressMeter - ENA|CM001880|CM001880.1:21116908 8.5 55718000 6551975.5
14:15:31.155 INFO ProgressMeter - ENA|CM001880|CM001880.1:22229786 8.7 56733000 6543032.6
14:15:41.158 INFO ProgressMeter - ENA|CM001880|CM001880.1:23429151 8.8 57780000 6538060.4
14:15:51.162 INFO ProgressMeter - ENA|CM001880|CM001880.1:24561027 9.0 58812000 6531618.6
14:16:01.168 INFO ProgressMeter - ENA|CM001880|CM001880.1:25856404 9.2 59914000 6533008.2
14:16:11.176 INFO ProgressMeter - ENA|CM001880|CM001880.1:27049050 9.3 61030000 6535824.1
14:16:21.175 INFO ProgressMeter - ENA|CM001880|CM001880.1:28365570 9.5 62265000 6551153.3
14:16:24.880 INFO IndexFeatureFile - Shutting down engine
[September 25, 2024 at 2:16:24 PM GMT] org.broadinstitute.hellbender.tools.IndexFeatureFile done. Elapsed time: 9.57 minutes.
Runtime.totalMemory()=1484783616
htsjdk.samtools.FileTruncatedException: Premature end of file: /gatk/cacao_GWAS/Downloads/GWAS_cacao_data/usftp21.novogene.com/Data_analysis/7.variant_calling/Haplotype_caller/Matina/matina.combined.g.vcf.gz
at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:541)
at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:479)
at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:469)
at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:207)
at htsjdk.samtools.util.BlockCompressedInputStream.readLine(BlockCompressedInputStream.java:317)
at htsjdk.tribble.readers.BlockCompressedAsciiLineReader.readLine(BlockCompressedAsciiLineReader.java:25)
at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:86)
at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:75)
at htsjdk.samtools.util.AbstractIterator.next(AbstractIterator.java:57)
at htsjdk.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:48)
at htsjdk.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:26)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
at htsjdk.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:43)
at org.broadinstitute.hellbender.utils.codecs.ProgressReportingDelegatingCodec.decodeLoc(ProgressReportingDelegatingCodec.java:46)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:689)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.next(IndexFactory.java:650)
at htsjdk.tribble.index.IndexFactory.createIndex(IndexFactory.java:511)
at htsjdk.tribble.index.IndexFactory.createTabixIndex(IndexFactory.java:476)
at htsjdk.tribble.index.IndexFactory.createTabixIndex(IndexFactory.java:502)
at htsjdk.tribble.index.IndexFactory.createIndex(IndexFactory.java:403)
at org.broadinstitute.hellbender.tools.IndexFeatureFile.createAppropriateIndexInMemory(IndexFeatureFile.java:109)
at org.broadinstitute.hellbender.tools.IndexFeatureFile.doWork(IndexFeatureFile.java:75)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
at org.broadinstitute.hellbender.Main.main(Main.java:306)
I might be missing something, and I appreciate any help from the community.
-
htsjdk.samtools.FileTruncatedException: Premature end of file:
This error message indicates that the file you are trying to index has a corrupt gzip block information therefore it cannot be indexed. Can you check to see if you can extract the file?
If not you may need to find an uncorrupt version of the file.
Regards.
-
Hi Gökalp Çelik,
Considering your answer, I first checked the bam files for error using the picard ValidateSamFile command. No bam file reported an error.
I then used the gatk ValidateVariants command to check all the g.vcf.gz files. I found a couple of them did not pass, so I deleted them and redid the database using the GenomicsDBImport command again.
Then I tried to run gatk GenotypeVCFs command but it stops suddenly without any errors or message. It just presents the prompt again.
Here is the code I am using. It may be important to mention I am running GATK from the Docker container.
gatk --java-options "-Xms15G -Xmx15G -XX:ParallelGCThreads=2" GenotypeGVCFs -R ../../../5.Ref_genome/Matina/GCA_000403535-chromosomes.fasta -V gendb://matina.genomicsDB -O matina.genotypes.vcf.gz -OVI true
At first I thought the problem could have been too much RAM because even though I specified a max of 15Gb (I have 50 Gb available), the systems showed me it is using all 50Gb. Therefore I reduced it to -Xms2G -Xmx2G. But the problem was still the same.
It starts running correctly, increasing the use of RAM progressively until it is using it all. I repeated this many times and it always stops around 8-10 hours. There are 10 chromosomes, and it usually stops at the 4th chromosome.
Sorry if this changes the whole problem I began with.
-
What is the ploidy of your samples?
-
Hi Gökalp Çelik,
It is diploid. I am working with Theobroma cacao.
-
Hi again.
Can you try running it on single chromosomes by providing the parameter -L to see if any particular chromosome is causing the issue?
There could be a memory leak however it is not something expected as our tests with human data always uses sensible amounts of memory and completes the whole task. One thing we do is that we run GenotypeGVCFs in shards split by hardmasked regions.
Let us know how it goes.
Regards.
-
Hi Gökalp Çelik,
I tried what you suggested and it worked. I ran the Genotype command to each chromosome separately and the process worked smoothly. I then use the GatherVcfs command to create a single VCF file and continue with the filtrations of the variants.
Thank you for solving this problem. I greatly appreciate it.
JB
Please sign in to leave a comment.
6 comments