HaplotypeCaller produces very large .idx files
Hi. I have been migrating our workflow our gatk3 workflow to gatk4 (4.1.4.1) and encountered a strange problem with HaplotypeCaller: The generated index file is larger than the generated g.vcf file. If I regenerate the index using IndexFeatureFile, the generated idx file is much smaller.
The reason I care is that the next step (GenomicsDBImport) is a lot faster (and uses less memory) with the small (regenerated) index files (0.3min vs 1.7 min per batch).
I am using the following command:
```
gatk HaplotypeCaller -R /opt/data/dna/testdata/hg19/ucsc_hg19.fa -I AACGAGAATGGAGCTCAA-1.bam -L ~/results/gatk4-MBM292-8/gatk.bed --emit-ref-confidence BP_RESOLUTION -O AACGAGAATGGAGCTCAA.g.vcf
```
This is its output:
```
Using GATK jar /home/anze/tp4/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/anze/tp4/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar HaplotypeCaller -R /opt/data/dna/testdata/hg19/ucsc_hg19.fa -I AACGAGAATGGAGCTCAA-1.bam -L /home/anze/results/gatk4-MBM292-8/gatk.bed --emit-ref-confidence BP_RESOLUTION -O AACGAGAATGGAGCTCAA.g.vcf
20:14:00.473 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/anze/tp4/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
20:14:00.972 INFO HaplotypeCaller - ------------------------------------------------------------
20:14:00.974 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.1.4.1
20:14:00.975 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
20:14:00.980 INFO HaplotypeCaller - Executing as anze@naringi-instance-rna on Linux v5.0.0-1028-gcp amd64
20:14:00.980 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
20:14:00.980 INFO HaplotypeCaller - Start Date/Time: February 20, 2020 8:14:00 PM UTC
20:14:00.981 INFO HaplotypeCaller - ------------------------------------------------------------
20:14:00.981 INFO HaplotypeCaller - ------------------------------------------------------------
20:14:00.995 INFO HaplotypeCaller - HTSJDK Version: 2.21.0
20:14:00.995 INFO HaplotypeCaller - Picard Version: 2.21.2
20:14:00.995 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
20:14:00.995 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
20:14:00.996 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
20:14:00.996 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
20:14:00.996 INFO HaplotypeCaller - Deflater: IntelDeflater
20:14:00.997 INFO HaplotypeCaller - Inflater: IntelInflater
20:14:00.997 INFO HaplotypeCaller - GCS max retries/reopens: 20
20:14:00.997 INFO HaplotypeCaller - Requester pays: disabled
20:14:00.997 INFO HaplotypeCaller - Initializing engine
20:14:01.650 INFO FeatureManager - Using codec BEDCodec to read file file:///home/anze/results/gatk4-MBM292-8/gatk.bed
20:14:01.671 INFO IntervalArgumentCollection - Processing 21692 bp from intervals
20:14:01.708 INFO HaplotypeCaller - Done initializing engine
20:14:01.711 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
20:14:01.735 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
20:14:01.735 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
20:14:01.794 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/anze/tp4/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
20:14:01.801 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/anze/tp4/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
20:14:01.860 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
20:14:01.862 INFO IntelPairHmm - Available threads: 16
20:14:01.862 INFO IntelPairHmm - Requested threads: 4
20:14:01.862 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
20:14:01.919 INFO ProgressMeter - Starting traversal
20:14:01.920 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
20:14:03.240 WARN InbreedingCoeff - InbreedingCoeff will not be calculated; at least 10 samples must have called genotypes
20:14:12.635 INFO ProgressMeter - chr6:44229670 0.2 80 448.0
20:14:25.251 INFO ProgressMeter - chr14:35871924 0.4 160 411.5
20:14:35.454 INFO ProgressMeter - chr19:9072842 0.6 220 393.6
20:14:41.728 INFO HaplotypeCaller - No reads filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
20:14:41.735 INFO ProgressMeter - chrX:44938526 0.7 268 403.9
20:14:41.735 INFO ProgressMeter - Traversal complete. Processed 268 total regions in 0.7 minutes.
20:14:42.047 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.08534575900000001
20:14:42.048 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 6.938380428
20:14:42.048 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 14.11 sec
20:14:42.052 INFO HaplotypeCaller - Shutting down engine
[February 20, 2020 8:14:42 PM UTC] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.69 minutes.
Runtime.totalMemory()=1819279360
```
And these are the file sizes created by Haplotype caller:
```
-rw-rw-r-- 1 anze anze 1.6M Feb 20 20:14 AACGAGAATGGAGCTCAA.g.vcf
-rw-rw-r-- 1 anze anze 7.9M Feb 20 20:14 AACGAGAATGGAGCTCAA.g.vcf.idx
```
And file sizes if I regenerate index using
```
gatk IndexFeatureFile --input AACGAGAATGGAGCTCAA.g.vcf
```
are
```
-rw-rw-r-- 1 anze anze 1.6M Feb 20 20:14 AACGAGAATGGAGCTCAA.g.vcf
-rw-rw-r-- 1 anze anze 128K Feb 20 20:15 AACGAGAATGGAGCTCAA.g.vcf.idx
```
-
The size of the index is determined by the number of bins, and in theory it should be optimizing itself for size, among other things. For what it's worth, 1.6MB can be the size of just the header, basically. I don't believe that the .idx files you have are atypically large, necessarily.
Try this: in your -O argument, add .gz to the extension, which should be compressed in size and should be functional as a Tabix index file.
-
Thanks, I'll try adding .gz to the extension, and see how it compares.
I do not really care about the index size, if the size would not influence the speed of GenomicsDBImport.
I went and looked at the code, there is a special case in IndexFeatureFile which generates a LinearIndex for .g.vcf files and DynamicIndex by default.
https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/IndexFeatureFile.java#L113
The code for HaplotypeCaller does not have this special case, it just passes INDEX_ON_THE_FLY to the createVCFWriter, which creates DynamicIndex by default.
https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java#L358 -
Derek Caetano-Anolles, I have rerun my analysis with modified extension (.g.vcf.gz). The run was as fast as when I used .g.vcf files and regenerated the index files manually. Thanks for your help!
-
Great! I'm glad it works! :)
Please sign in to leave a comment.
4 comments