Mutect2 - issue with long scaffolds (non-human/mouse)
Hi,
I ran into the error below running Mutect2 on normal samples for creating PoN. I suspect that this may be related to the fact that the species that I work with have some long chromosomes, with the genome assembly containing large scaffolds up to ~720mb. I did not have any issue generating GVCF using HaplotypeCaller on the same bam files though...
a) GATK version used
gatk-4.1.7.0 (same error occurred using 4.1.6.0)
b) Exact GATK commands used
gatk Mutect2 -R reference.fasta -I normal1.bam -max-mnp-distance 0 -O normal1.vcf.gz
c) The entire error log if applicable.
14:43:18.677 INFO ProgressMeter - 1:536853327 369.4 2731120 7394.3
14:43:21.422 INFO Mutect2 - Shutting down engine
[May 2, 2020 2:43:21 PM AEST] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 369.42 minutes.
Runtime.totalMemory()=23421517824
java.lang.ArrayIndexOutOfBoundsException: 32770
at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:142)
at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeFeature(TabixIndexCreator.java:106)
at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeIndex(TabixIndexCreator.java:129)
at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.close(IndexingVariantContextWriter.java:177)
at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:233)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.closeTool(Mutect2.java:305)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1052)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)
(END)
Thank you.
-
Yuanyuan Cheng Out of curiosity, are you studying opossums?
Try outputting in uncompressed vcf format: "-O normal1.vcf" instead of "-O normal1.vcf.gz". The tabix (.tbi) format that compresses a bgzipped vcf is hard-coded to go only to 2^29 (that's 536 million) bases, while I believe the .idx format that compresses unzipped vcf has no such limitation.
There exists a .csi index format that I believe the GATK can use as input, but here the problem is the index that the GATK generates on the fly, and the GATK has no capacity to emit a .csi index.
See a related discussion here: https://github.com/broadinstitute/gatk/issues/6110.
-
Many thanks David! That makes a lot of sense. I am rerunning it now using uncompressed vcf for output format.
I study Tasmanian devils. Lots of marsupials seem to have super long chromosomes, which can cause unexpected trouble sometimes :)
Please sign in to leave a comment.
2 comments