Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

java.lang.ArrayIndexOutOfBoundsException: 32772 while running GenotypeGVCFs

Answered
0

18 comments

  • Avatar
    Pamela Bretscher

    Hi Alon Ziv,

    Here is a forum post about a similar issue that might have a helpful workaround for you. This error is most likely occurring due to an error or mismatch in your gvcf files. I would also suggest running ValidateVariants on the files to pinpoint the problem. 

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Alon Ziv

    thanks for the quick replay!!! I will try running, Validate Variants

    I saw this forum post you sent but I couldn't make it work right.... so I'll try again,

    Thank you!

    Alon

    0
    Comment actions Permalink
  • Avatar
    Alon Ziv

    Hi Pamela Bretscher,

    I've tried to run ValidateVariants as follows:

    gatk ValidateVariants -R Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa -V MA1.g.vcf 

    and received this error message:

    Using GATK jar /home/alonzi/miniconda3/envs/rna-seq/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/alonzi/miniconda3/envs/rna-seq/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar ValidateVariants -R Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa -V MA1.g.vcf
    14:54:31.346 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/alonzi/miniconda3/envs/rna-seq/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Jul 08, 2021 2:54:31 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    14:54:31.464 INFO ValidateVariants - ------------------------------------------------------------
    14:54:31.465 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.2.0.0
    14:54:31.465 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
    14:54:31.465 INFO ValidateVariants - Executing as alonzi@khalil1 on Linux v4.19.0-17-amd64 amd64
    14:54:31.465 INFO ValidateVariants - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_282-b08
    14:54:31.465 INFO ValidateVariants - Start Date/Time: July 8, 2021 2:54:31 PM IDT
    14:54:31.465 INFO ValidateVariants - ------------------------------------------------------------
    14:54:31.465 INFO ValidateVariants - ------------------------------------------------------------
    14:54:31.465 INFO ValidateVariants - HTSJDK Version: 2.24.0
    14:54:31.465 INFO ValidateVariants - Picard Version: 2.25.0
    14:54:31.465 INFO ValidateVariants - Built for Spark Version: 2.4.5
    14:54:31.465 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    14:54:31.465 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    14:54:31.465 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    14:54:31.466 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    14:54:31.466 INFO ValidateVariants - Deflater: IntelDeflater
    14:54:31.466 INFO ValidateVariants - Inflater: IntelInflater
    14:54:31.466 INFO ValidateVariants - GCS max retries/reopens: 20
    14:54:31.466 INFO ValidateVariants - Requester pays: disabled
    14:54:31.466 INFO ValidateVariants - Initializing engine
    14:54:31.717 INFO FeatureManager - Using codec VCFCodec to read file file:///media/alonzi/DATA/Alon/MA1.g.vcf
    14:54:31.896 INFO ValidateVariants - Done initializing engine
    14:54:31.896 WARN ValidateVariants - IDS validation cannot be done because no DBSNP file was provided
    14:54:31.896 WARN ValidateVariants - Other possible validations will still be performed
    14:54:31.896 INFO ProgressMeter - Starting traversal
    14:54:31.896 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    14:54:32.049 INFO ValidateVariants - Shutting down engine
    [July 8, 2021 2:54:32 PM IDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.01 minutes.
    Runtime.totalMemory()=964689920
    ***********************************************************************

    A USER ERROR has occurred: Input MA1.g.vcf fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position 1A:3456221 are not observed at all in the sample genotypes

    ***********************************************************************
    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace

    I'm assuming something is wrong with my gvfcs files.... are there any suggestions about what I should do now?

    Thanks in advance,

    Alon

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Alon Ziv,

    I was able to find a few past forum posts with the same error message, and it seems that you most likely don't need to worry about this error. You can try the --warn-on-errors argument when running ValidateVariants so warnings will be emitted on these errors rather than terminating the job. 

    https://gatk.broadinstitute.org/hc/en-us/community/posts/360067695771-GenotypeGvcfs-has-formatting-issues-in-both-v4-1-6-0-as-v4-1-7-0

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Alon Ziv,

    I apologize, the issue that I referenced in my previous response has actually been resolved already in an earlier version of GATK. If you are still seeing this error, could you please share the site that is causing the issue so we can look further into it?

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Alon Ziv

    Pamela Bretscher regarding your last massage, should I try doing it with the --warn-on-errors argument? and if it still does not work send you the site that casing the issue?? and also how do I share the site that causing the issue

    Thanks again,

    Alon

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Alon Ziv,

    No, you do not need to try --warn-on-errors because the issue this addresses has already been solved. Could you share the portion of the VCF file that is causing the error message when you run ValidateVariants?

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Alon Ziv

    do you mean this picture ?

    the first error i get is at the 1A: 3456221 which is the first 'blue' rectangle, indicating a SNP in some of my samples.

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Alon Ziv,

    I showed your original stack trace from your GenotypeGVCFs error to some of the GATK developers and it is possible that the error is occurring due to a reference mismatch between your reference file and your vcf file headers. Could you verify that the headers/contig lengths in your vcf files match your reference file (Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa). If everything is compatible, then I can submit this as a GitHub issue for our developers to investigate further.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Alon Ziv

    Hi Pamela Bretscher,

    these are my reference headers and contig lengths

    >1A dna:chromosome chromosome:WEWSeq_v.1.0:1A:1:593586810:1 REF
    >1B dna:chromosome chromosome:WEWSeq_v.1.0:1B:1:690537804:1 REF
    >2A dna:chromosome chromosome:WEWSeq_v.1.0:2A:1:775183943:1 REF
    >2B dna:chromosome chromosome:WEWSeq_v.1.0:2B:1:803365466:1 REF
    >3A dna:chromosome chromosome:WEWSeq_v.1.0:3A:1:754274518:1 REF
    >3B dna:chromosome chromosome:WEWSeq_v.1.0:3B:1:841096276:1 REF
    >4A dna:chromosome chromosome:WEWSeq_v.1.0:4A:1:726427787:1 REF
    >4B dna:chromosome chromosome:WEWSeq_v.1.0:4B:1:673896466:1 REF
    >5A dna:chromosome chromosome:WEWSeq_v.1.0:5A:1:700855599:1 REF
    >5B dna:chromosome chromosome:WEWSeq_v.1.0:5B:1:712180895:1 REF
    >6A dna:chromosome chromosome:WEWSeq_v.1.0:6A:1:621432051:1 REF
    >6B dna:chromosome chromosome:WEWSeq_v.1.0:6B:1:703217322:1 REF
    >7A dna:chromosome chromosome:WEWSeq_v.1.0:7A:1:727576108:1 REF
    >7B dna:chromosome chromosome:WEWSeq_v.1.0:7B:1:755408349:1 REF

    and here is an example from one of my vcf files (i checked all of them)

    test of Alt vs. Ref read position bias">
    ##contig=<ID=1A,length=593586810>
    ##contig=<ID=1B,length=690537804>
    ##contig=<ID=2A,length=775183943>
    ##contig=<ID=2B,length=803365466>
    ##contig=<ID=3A,length=754274518>
    ##contig=<ID=3B,length=841096276>
    ##contig=<ID=4A,length=726427787>
    ##contig=<ID=4B,length=673896466>
    ##contig=<ID=5A,length=700855599>
    ##contig=<ID=5B,length=712180895>
    ##contig=<ID=6A,length=621432051>
    ##contig=<ID=6B,length=703217322>
    ##contig=<ID=7A,length=727576108>
    ##contig=<ID=7B,length=755408349>
    ##source=HaplotypeCaller
    ##bcftools_viewVersion=1.10.2+htslib-1.10.2
    ##bcftools_viewCommand=view --header-only 1.g.vcf; Date=Tue Jul 13 09:35:11 2021
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT F4_1

    i don't see any differences is there something i'm missing here?

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Alon Ziv,

    From what I can tell, it looks like everything matches up, so I created a Github ticket so the issue can be investigated further. I'm going to see if there is a workaround you could use in the meantime and will follow up.

    https://github.com/broadinstitute/gatk/issues/7348 

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Alon Ziv

    thanks  Pamela Bretscher,

    i hope we will find a way to solve this or workaround it...

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Alon Ziv,

    Could you please post the lines of your MA1.g.vcf file that include the 1A:3456221 position causing the ValidateVariants error?

    Thanks,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Alon Ziv

    Pamela Bretscher do you mean this?

    1A	3456210	.	G	<NON_REF>	.	.	END=3456210	GT:DP:GQ:MIN_DP:PL	0/0:21:51:21:0,51,765
    1A 3456211 . A <NON_REF> . . END=3456212 GT:DP:GQ:MIN_DP:PL 0/0:21:45:21:0,45,675
    1A 3456213 . T <NON_REF> . . END=3456214 GT:DP:GQ:MIN_DP:PL 0/0:20:42:20:0,42,630
    1A 3456215 . G <NON_REF> . . END=3456217 GT:DP:GQ:MIN_DP:PL 0/0:18:39:18:0,39,585
    1A 3456218 . T <NON_REF> . . END=3456220 GT:DP:GQ:MIN_DP:PL 0/0:17:27:17:0,27,405
    1A 3456221 . C G,<NON_REF> 592.06 . DP=16;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=57600,16 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 1|1:0,14,0:14:42:0|1:3456221_C_G:606,42,0,606,42,606:3456221:0,0,6,8
    1A 3456222 . G <NON_REF> . . END=3456223 GT:DP:GQ:MIN_DP:PL 0/0:14:27:14:0,27,405
    1A 3456224 . A <NON_REF> . . END=3456224 GT:DP:GQ:MIN_DP:PL 0/0:14:24:14:0,24,360
    1A 3456225 . G <NON_REF> . . END=3456226 GT:DP:GQ:MIN_DP:PL 0/0:14:18:13:0,18,270
    0
    Comment actions Permalink
  • Avatar
    Alon Ziv

    Hi Pamela Bretscher, i think i might have manged to solve my problem

    i created a BED file directly from the reference genome fasta using

    grep "^>" Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa > file.bed

    and than just edited each line to look like this:

    1A 1 593586810
    1B 1 690537804
    2A 1 775183943
    2B 1 803365466
    3A 1 754274518
    3B 1 841096276
    4A 1 726427787
    4B 1 673896466
    5A 1 700855599
    5B 1 712180895
    6A 1 621432051
    6B 1 703217322
    7A 1 727576108
    7B 1 755408349

    i then used GenomicsDBimport and used the BED file for intervals

    gatk GenomicsDBImport -V 1.g.vcf -V 2.g.vcf -V 3.g.vcf -V 4.g.vcf -V 5.g.vcf -V 6.g.vcf -V 7.g.vcf -V 8.g.vcf -V 9.g.vcf --genomicsdb-workspace-path my_database1AB -L file.bed 

    and then used the GenotypeGVCFs command and it worked

    gatk --java-options "-Xmx12g -Xms12g" GenotypeGVCFs -R Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa -V gendb://my_database1AB -O global.vcf --new-qua
    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Alon Ziv,

    I'm glad you were able to find a workaround and thank you for posting your solution here for other researchers who may have a similar problem!

    Please let me know if you need anything else or have additional questions.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Alon Ziv

    Hi Pamela Bretscher,

    i don't have any additional questions at the moment

    and again,

    thank you for your help!!!

    Alon

    0
    Comment actions Permalink
  • Avatar
    jianhui guo

    change  -O  output.vcf.gz to -O  output.vcf if chromsomes size more then 530M+ 

     
    0
    Comment actions Permalink

Post is closed for comments.

Powered by Zendesk