Issue with GenomicsDBImport on very large chromosomes
So I'm having the following issue. I managed to import all of samples into GenomicsDB and it all seemed to work fine. However, when running GenotypeGVCFs on certain chromosomes the engine shuts down at around 535 Mb (some of these chromosomes are ~650 Mb). For chromosomes that are shorter than this it is fine. I've tried splitting the chromosomes in half or giving additional memory but the engine shuts down at the same place along the chromosome, which makes me believe the import into GenomicsDBImport didn't run correctly. Should I split the chromosomes for GenomicsDBImport too? I've provided an example of one of the chromosomes below.
Edit: I've now also tried split the chromosomes using -L going into GenomicsDB and it seems to have no affect, it still fails at the same point along the chromosome. Is there some kind of option I'm missing or would I have to just renamed the end of the chromosome and say its a completely new? I'd have to remake all the bams if I could that route
REQUIRED for all errors and issues:
a) GATK version used: 4.2.0.0
b) Exact command used: gatk --java-options "-Xmx96g" GenotypeGVCFs -R MorexV3.fasta -V gendb://AvcixBulbul -L chr3H:300000001-621516506 -O AcvixBulbul_3H.2.vcf.gz
c) Entire program log: Using GATK jar /local/cluster/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx96g -jar /local/cluster/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar GenotypeGVCFs -R MorexV3.fasta -V gendb://AvcixBulbul -L chr3H:300000001-621516506 -O AcvixBulbul_3H.2.vcf.gz
08:20:32.343 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/local/cluster/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
May 10, 2023 8:20:32 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
08:20:32.496 INFO GenotypeGVCFs - ------------------------------------------------------------
08:20:32.496 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.0.0
08:20:32.496 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
08:20:32.497 INFO GenotypeGVCFs - Executing as clareshaun@hoser.cgrb.oregonstate.local on Linux v3.10.0-1062.4.1.el7.x86_64 amd64
08:20:32.497 INFO GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_71-b15
08:20:32.497 INFO GenotypeGVCFs - Start Date/Time: May 10, 2023 8:20:32 AM PDT
08:20:32.497 INFO GenotypeGVCFs - ------------------------------------------------------------
08:20:32.497 INFO GenotypeGVCFs - ------------------------------------------------------------
08:20:32.497 INFO GenotypeGVCFs - HTSJDK Version: 2.24.0
08:20:32.497 INFO GenotypeGVCFs - Picard Version: 2.25.0
08:20:32.497 INFO GenotypeGVCFs - Built for Spark Version: 2.4.5
08:20:32.498 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
08:20:32.498 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
08:20:32.498 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
08:20:32.498 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
08:20:32.498 INFO GenotypeGVCFs - Deflater: IntelDeflater
08:20:32.498 INFO GenotypeGVCFs - Inflater: IntelInflater
08:20:32.498 INFO GenotypeGVCFs - GCS max retries/reopens: 20
08:20:32.498 INFO GenotypeGVCFs - Requester pays: disabled
08:20:32.498 INFO GenotypeGVCFs - Initializing engine
08:20:33.197 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
08:20:33.299 info NativeGenomicsDB - pid=93178 tid=93259 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
08:20:33.299 info NativeGenomicsDB - pid=93178 tid=93259 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
08:20:33.299 info NativeGenomicsDB - pid=93178 tid=93259 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
08:20:33.425 INFO IntervalArgumentCollection - Processing 321516506 bp from intervals
08:20:33.429 INFO GenotypeGVCFs - Done initializing engine
08:20:33.503 INFO ProgressMeter - Starting traversal
08:20:33.504 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
08:21:55.222 INFO ProgressMeter - chr3H:302601154 1.4 1000 734.2
08:22:06.277 INFO ProgressMeter - chr3H:312566343 1.5 7000 4527.2
08:22:16.291 INFO ProgressMeter - chr3H:323789709 1.7 13000 7588.5
08:22:27.230 INFO ProgressMeter - chr3H:329345709 1.9 17000 8968.9
08:22:39.894 INFO ProgressMeter - chr3H:336187363 2.1 22000 10443.9
08:22:50.364 INFO ProgressMeter - chr3H:342910962 2.3 26000 11398.5
08:23:00.896 INFO ProgressMeter - chr3H:349074419 2.5 30000 12212.3
08:23:11.839 INFO ProgressMeter - chr3H:354496085 2.6 34000 12884.1
08:23:23.828 INFO ProgressMeter - chr3H:359316934 2.8 38000 13386.3
08:23:37.320 INFO ProgressMeter - chr3H:365543501 3.1 43000 14035.9
08:23:48.165 INFO ProgressMeter - chr3H:372587964 3.2 47000 14486.7
08:23:58.838 INFO ProgressMeter - chr3H:379103213 3.4 53000 15487.0
08:24:09.693 INFO ProgressMeter - chr3H:386442687 3.6 59000 16374.6
08:24:31.601 INFO ProgressMeter - chr3H:389756924 4.0 62000 15623.9
08:24:41.931 INFO ProgressMeter - chr3H:395848746 4.1 68000 16423.3
08:24:53.708 INFO ProgressMeter - chr3H:404324917 4.3 76000 17524.7
08:25:06.030 INFO ProgressMeter - chr3H:408739537 4.5 81000 17833.2
08:25:17.442 INFO ProgressMeter - chr3H:416980209 4.7 88000 18595.6
08:25:28.029 INFO ProgressMeter - chr3H:421171654 4.9 95000 19353.2
08:25:46.231 INFO ProgressMeter - chr3H:427110567 5.2 101000 19377.9
08:25:56.781 INFO ProgressMeter - chr3H:433146636 5.4 109000 20230.3
08:26:08.270 INFO ProgressMeter - chr3H:437670014 5.6 115000 20611.4
08:26:52.378 INFO ProgressMeter - chr3H:441755184 6.3 119000 18845.3
08:27:03.901 INFO ProgressMeter - chr3H:444714982 6.5 125000 19211.2
08:27:14.433 INFO ProgressMeter - chr3H:449099541 6.7 131000 19604.5
08:27:25.290 INFO ProgressMeter - chr3H:453612383 6.9 138000 20107.5
08:27:36.410 INFO ProgressMeter - chr3H:457878406 7.0 147000 20855.7
08:27:47.054 INFO ProgressMeter - chr3H:461735224 7.2 153000 21174.0
08:27:57.219 INFO ProgressMeter - chr3H:464929683 7.4 159000 21500.3
08:28:07.305 INFO ProgressMeter - chr3H:466754806 7.6 165000 21815.7
08:28:18.299 INFO ProgressMeter - chr3H:470859422 7.7 171000 22074.2
08:28:32.753 INFO ProgressMeter - chr3H:471245618 8.0 172000 21533.7
08:28:44.556 INFO ProgressMeter - chr3H:475364845 8.2 180000 21993.6
08:28:55.222 INFO ProgressMeter - chr3H:478354454 8.4 184000 22004.4
08:29:05.575 INFO ProgressMeter - chr3H:481821730 8.5 189000 22145.4
08:29:17.804 INFO ProgressMeter - chr3H:486006519 8.7 197000 22544.3
08:29:28.150 INFO ProgressMeter - chr3H:490316111 8.9 204000 22893.8
08:29:39.864 INFO ProgressMeter - chr3H:493939547 9.1 209000 22951.9
08:29:50.080 INFO ProgressMeter - chr3H:496657095 9.3 215000 23177.4
08:30:00.942 INFO ProgressMeter - chr3H:498796305 9.5 219000 23156.7
08:30:39.223 INFO ProgressMeter - chr3H:500221980 10.1 223000 22089.5
08:30:49.245 INFO ProgressMeter - chr3H:502228806 10.3 229000 22314.6
08:31:00.129 INFO ProgressMeter - chr3H:507128240 10.4 236000 22597.2
08:31:11.152 INFO ProgressMeter - chr3H:509886284 10.6 241000 22677.1
08:31:21.561 INFO ProgressMeter - chr3H:512382022 10.8 247000 22868.4
08:31:32.064 INFO ProgressMeter - chr3H:514552810 11.0 253000 23050.3
08:31:42.226 INFO ProgressMeter - chr3H:517492424 11.1 260000 23328.1
08:31:52.920 INFO ProgressMeter - chr3H:520809247 11.3 268000 23667.4
08:32:03.119 INFO ProgressMeter - chr3H:522664270 11.5 273000 23752.4
08:32:20.170 INFO ProgressMeter - chr3H:523097647 11.8 274000 23264.2
08:32:30.879 INFO ProgressMeter - chr3H:525665242 12.0 280000 23418.7
08:32:40.974 INFO ProgressMeter - chr3H:527884667 12.1 285000 23506.1
08:32:54.177 INFO ProgressMeter - chr3H:530016299 12.3 290000 23492.1
08:33:06.107 INFO ProgressMeter - chr3H:531576329 12.5 296000 23598.1
08:33:16.724 INFO ProgressMeter - chr3H:532237078 12.7 300000 23584.3
08:33:27.767 INFO ProgressMeter - chr3H:533621648 12.9 306000 23712.9
08:33:39.033 INFO ProgressMeter - chr3H:534977681 13.1 311000 23754.7
08:33:49.546 INFO ProgressMeter - chr3H:535958152 13.3 316000 23817.8
08:33:54.988 INFO GenotypeGVCFs - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),194.3197662599987,Cpu time(s),139.4770293000004
[May 10, 2023 8:33:55 AM PDT] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 13.38 minutes.
Runtime.totalMemory()=1332740096
java.lang.ArrayIndexOutOfBoundsException: 32775
at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:142)
at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeFeature(TabixIndexCreator.java:106)
at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeIndex(TabixIndexCreator.java:129)
at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.close(IndexingVariantContextWriter.java:177)
at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:233)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.closeTool(GenotypeGVCFs.java:295)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1064)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Full Command: gatk --java-options -Xmx96g GenotypeGVCFs -R MorexV3.fasta -V gendb://AvcixBulbul -L chr3H:300000001-621516506 -O AcvixBulbul_3H.2.vcf.gz
Memory (kb): 6217296
# SWAP (freq): 0
# Waits (freq): 1191480
CPU (percent): 116%
Time (seconds): 806.19
Time (hh:mm:ss.ms): 13:26.19
System CPU Time (seconds): 120.09
User CPU Time (seconds): 819.00
-
Hello,
You've unfortunately hit an annoying problem that's hard to work around. There is a limit on chromosome size when using a .bai index of 2^29-1 (536870911 bp). The index can't handle anything longer than that. There is a CSI index which supports longer ones but we don't have good support for it in gatk. Splitting the chromosome is the right idea, you have to split it at the alignment stage into 2 differently named contigs. You can't simply import half of it at a time into genomicsdb unfortunately. I'm sorry to bring bad news. I know that's a huge hassle since you already have the bams.
Louis
-
I did find this with someone saying they found a work around using unzipped g.vcf but I'm not sure if that's just to get into GenomicsDB? I was working with unzipped g.vcf anyway and it didn't work.
That's unfortunate, it'll be a big pain to adjust all the positions afterwards to to get the real position but thank you for the clarification
-
It's possible there is a workaround using unzip vcf since that uses a different index format so it might have different limits. It's generally so unwieldy to use unzipped gvcfs that I don't have much experience doing so
I'm sorry it's such a pain. Creating a liftover file between the split reference and the actual reference might be a good solution but you'd probably have to use non-gatk/picard tools to handle it. I'd like to fix it but it's non-trivial change and we have a lot of higher priority issues ahead of it.
Please sign in to leave a comment.
3 comments