Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

VariantRecalibrator IndexOutOfBoundsException

Answered
0

14 comments

  • Avatar
    Lindo Nkambule

    Hi GATK Team, I am checking in to see if someone has seen this issue?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Lindo Nkambule,

    Thank you so much for the reminder, I'm sorry that we didn't get to your post sooner. 

    It looks like this error is coming up because there is a mismatch between the VCF INFO field and the annotations at the variant site given in the error message (chr1:10007 in /io/batch/005ed6/inputs/JqoVH/gnomad_genomes_v3.1_info.vcf.bgz). How did you create or obtain or create this VCF? Do you know if it was created with GATK tools?

    To solve this issue, I would first recommend running the ValidateVariants tool. This tool will make sure that the VCF is in the correct formatting. You can also take a look at that variant site (chr1:10007) and see if there is an issue with any of the annotations in the INFO field as compared to the header.

    Please let me know how this goes and if you have any more questions.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Lindo Nkambule

    Hi Genevieve Brandt (she/her)

     

    I've tried using ValidateVariants but it doesn't output anything. This is what I get

    14:17:33.276 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    14:17:33.400 INFO  ValidateVariants - ------------------------------------------------------------
    14:17:33.401 INFO  ValidateVariants - The Genome Analysis Toolkit (GATK) v4.2.6.1
    14:17:33.401 INFO  ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
    14:17:33.401 INFO  ValidateVariants - Executing as root@hostname-21558eb86b on Linux v5.4.0-1042-gcp amd64
    14:17:33.401 INFO  ValidateVariants - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
    14:17:33.401 INFO  ValidateVariants - Start Date/Time: August 19, 2022 2:17:33 PM GMT
    14:17:33.401 INFO  ValidateVariants - ------------------------------------------------------------
    14:17:33.401 INFO  ValidateVariants - ------------------------------------------------------------
    14:17:33.401 INFO  ValidateVariants - HTSJDK Version: 2.24.1
    14:17:33.401 INFO  ValidateVariants - Picard Version: 2.27.1
    14:17:33.402 INFO  ValidateVariants - Built for Spark Version: 2.4.5
    14:17:33.402 INFO  ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    14:17:33.402 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    14:17:33.402 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    14:17:33.402 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    14:17:33.402 INFO  ValidateVariants - Deflater: IntelDeflater
    14:17:33.402 INFO  ValidateVariants - Inflater: IntelInflater
    14:17:33.402 INFO  ValidateVariants - GCS max retries/reopens: 20
    14:17:33.402 INFO  ValidateVariants - Requester pays: disabled
    14:17:33.402 INFO  ValidateVariants - Initializing engine
    14:17:35.049 INFO  FeatureManager - Using codec VCFCodec to read file file:///io/batch/1679f0/inputs/B1Pqb/Homo_sapiens_assembly38.dbsnp138.vcf.gz
    14:17:35.245 INFO  FeatureManager - Using codec VCFCodec to read file file:///io/batch/1679f0/inputs/fwU01/gnomad_genomes_v3.1_info.vcf.bgz
    14:17:35.675 INFO  ValidateVariants - Done initializing engine
    14:17:36.538 INFO  ProgressMeter - Starting traversal
    14:17:36.538 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
    Using GATK jar /gatk/gatk-package-4.2.6.1-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms60g -jar /gatk/gatk-package-4.2.6.1-local.jar ValidateVariants -R gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -V /io/batch/1679f0/inputs/fwU01/gnomad_genomes_v3.1_info.vcf.bgz --dbsnp /io/batch/1679f0/inputs/B1Pqb/Homo_sapiens_assembly38.dbsnp138.vcf.gz

     

    Please assist

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks Lindo Nkambule! Could you also let me know the answers to the other questions I posted?

    0
    Comment actions Permalink
  • Avatar
    Lindo Nkambule

    The VCF was created by the gnomAD team and it's not yet public unfortunately, I am using it for testing a pipeline. I am not sure which software was used to generate the VCF but I'm almost certain that the allele-specific annotations were added using gnomAD functionality 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks Lindo Nkambule! Is it possible to share the first couple variant lines of the file? We would like to check the INFO field annotations. You can send via my Broad email, if that works for you. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Lindo Nkambule thanks for sending over the VCF snippet! Could you also provide the exact VariantRecalibrator command line used? The one you gave in the post has a few variables, it would be helpful to see what was actually run. 

    0
    Comment actions Permalink
  • Avatar
    Lindo Nkambule

    Sure! Below is the command:

    gatk --java-options "-Xms118g -XX:+UseParallelGC -XX:ParallelGCThreads=14" \
      VariantRecalibrator \
      -V gs://hgdp-1kg/vqsr_pipeline/gnomad_genomes_v3.1_info.vcf.bgz \
      -O ${BATCH_TMPDIR}/VQSR__SNPsVariantRecalibratorCreateModel-Dx8eQ/recalibration \
      --tranches-file ${BATCH_TMPDIR}/VQSR__SNPsVariantRecalibratorCreateModel-Dx8eQ/tranches \
      --trust-all-polymorphic \
      -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \
      -an AS_QD -an AS_MQRankSum -an AS_ReadPosRankSum -an AS_FS -an AS_SOR -an AS_MQ \
      -mode SNP \
      --use-allele-specific-annotations \
      --sample-every-Nth-variant 75 \
      --output-model ${BATCH_TMPDIR}/VQSR__SNPsVariantRecalibratorCreateModel-Dx8eQ/model_file \
      --max-gaussians 6 \
      -resource:hapmap,known=false,training=true,truth=true,prior=15 gs://gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz \
      -resource:omni,known=false,training=true,truth=true,prior=12 gs://gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz \
      -resource:1000G,known=false,training=true,truth=false,prior=10 gs://gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
      -resource:dbsnp,known=true,training=false,truth=false,prior=7 gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
      --rscript-file ${BATCH_TMPDIR}/VQSR__SNPsVariantRecalibratorCreateModel-Dx8eQ/snp_rscript
    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you Lindo Nkambule! Could you check and see if the AS_SOR annotation is missing from all the records in the VCF?

    You are specifying the annotation in your VariantRecalibrator command but it is most likely missing in your VCF (-an AS_SOR).

    Potentially this annotation was not added upstream in your pipeline at one point. Do you know if that might be the case?

    1
    Comment actions Permalink
  • Avatar
    Lindo Nkambule

    Hi Genevieve Brandt (she/her), that was the case. Running the command without -an AS_SOR doesn't give issues.

    Thank you!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you Lindo Nkambule!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Lindo Nkambule,

    I have a follow up question regarding how you are running VQSR. We noticed that you are using the argument --sample-every-Nth-variant 75 in your VariantRecalibrator command. With this parameter, you are throwing away around 98.7% of your data and then only using 16k training sites. 

    I am wondering if you are using this parameter just for training purposes, or for production as well? If it's for production, we might want to find a better way to optimize VariantRecalibrator because this parameter could lead to a model that is not trained well enough. 

    Let me know your thoughts about this. We want to make sure that VQSR can perform well for your use case!

    Best regards,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Hi Lindo Nkambule,

    Just wanted to follow up with you as well. To clarify Genevieve's comment, when setting that particular parameter to 75, that means you are only retaining 1 out of every 75 variants (and hence, are throwing away 74 / 75 = 98.7% of your data).

    I was doing some digging about this parameter on Slack and happened to notice this JSON: https://github.com/broadinstitute/hail-ukbb-200k-callset/blob/21c5d2c084d75bb8706e7968ae8ac536c4b855ab/GenotypeAndFilter.AS.json It seems a previous UKBB callset was run with this parameter set to 75 as well, which led to only 16k training sites for VQSR being retained in that particular callset. I'm not sure what the corresponding number of training sites is for your current callset.

    More generally, I'd like to understand what groups at the Broad might still be running VQSR with these parameters. If this is something you'd like to discuss, feel free to contact me via Slack or Broad email!

    0
    Comment actions Permalink
  • Avatar
    Lindo Nkambule

    Hi Genevieve Brandt (she/her) and Samuel Lee

    Apologies for the late response. My primary goal was to write this pipeline here in Hail Batch and the defaults are 75 for SNPs and 10 for INDELs, I haven't done much digging into the number of training sites and what the outputs look like. I was using the gnomAD VCF for testing purposes. I'm almost done with getting every step of the pipeline running, can we schedule a meeting to go through some of the results to get your inputs on the parameters and results? I should have everything working next week Wednesday latest.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk