Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770

0

3 comments

  • Avatar
    Gökalp Çelik

    Hi Hanan Sela

    According to SAM specification v1

    In the BAI format, each bin may span 2^29, 2^26, 2^23, 2^20, 2^17 or 2^14 bp. Bin 0 spans a 512Mbp region, bins
    1–8 span 64Mbp, 9–72 8Mbp, 73–584 1Mbp, 585–4680 128kbp, and bins 4681–37448 span 16kbp regions.
    This implies that this index format does not support reference chromosome sequences longer than 229 − 1.
    The CSI format generalises the sizes of the bins, and supports reference sequences of the same length as
    are supported by SAM and BAM.

    This means that you need to create a CSI index for your BAM file and run HaplotypeCaller with the additional parameter

    --create-output-variant-index false

    This will help HaplotypeCaller run without issues and will generate a VCF file without an index. Unfortunately there is no CSI like index support for VCFs in HTSJDK therefore using this VCF file in downstream analyses might require additional work that is currently beyond GATK's capabilities. 

    There may be future plans to implement such functionality but I cannot give a definitive answer to when such implementation may occur. 

    I hope this helps. 

     

    0
    Comment actions Permalink
  • Avatar
    Hanan Sela

    Hi

    In this post it is claimed that indexing with samtools  after GVCF generation can help. Is   SAMtools generated CSI is compatible with downstream applications such as GenomicsDBImport and GenotypeGVCFs?

    Thank you.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Hanan Sela

    Unfortunately current neither HTSJDK nor any GATK tools are compatible with the CSI format VCF index. If you wish to perform joint genotyping glnexus might seem to be an option but since it is outside of our realm we cannot provide any support for it. Here is the wording from the github page of glnexus.

    glnexus_cli does not use tabix indices for the input gVCFs. If you need to process only a few selected genomic ranges, then it may be advantageous to slice your gVCFs beforehand.

    Since it does not care about the tabix index (You cannot have anyway with your genome size) you may genotype whole gVCF without issues. Of course YMMV. 

    I hope this helps. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk