VariantRecalibrator IndexOutOfBoundsException
AnsweredHi GATK team and community,
I've encountered an IndexOutOfBoundsException error while trying to run VariantRecalibrator
GATK version used:
4.2.6.1
Exact command used:
gatk --java-options -Xms{java_mem}g \\
VariantRecalibrator \\
-V {sites_only_vcf['vcf.gz']} \\
-O {j.recalibration} \\
--tranches-file {j.tranches} \\
--trust-all-polymorphic \\
{tranche_cmdl} \\
{an_cmdl} \\
-mode SNP \\
{"--use-allele-specific-annotations" if use_as_annotations else ""} \\
--sample-every-Nth-variant {downsample_factor} \\
--output-model {j.model_file} \\
--max-gaussians {max_gaussians} \\
-resource:hapmap,known=false,training=true,truth=true,prior=15 {utils['hapmap_resource_vcf']} \\
-resource:omni,known=false,training=true,truth=true,prior=12 {utils['omni_resource_vcf']} \\
-resource:1000G,known=false,training=true,truth=false,prior=10 {utils['one_thousand_genomes_resource_vcf']} \\
-resource:dbsnp,known=true,training=false,truth=false,prior=7 {utils['dbsnp_resource_vcf']} \\
{f'-resource:singletons,known=true,training=true,truth=true,prior=10 {transmitted_singletons_resource_vcf.base}' if transmitted_singletons_resource_vcf else ''} \\
{f'-resource:singletons,known=true,training=true,truth=true,prior=10 {sibling_singletons_resource_vcf.base}' if sibling_singletons_resource_vcf else ''} \\
--rscript-file {j.snp_rscript}
"""
Entire error log
13:44:48.859 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so 13:44:48.884 INFO VariantRecalibrator - ------------------------------------------------------------ 13:44:48.884 INFO VariantRecalibrator - The Genome Analysis Toolkit (GATK) v4.2.6.1 13:44:48.884 INFO VariantRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/ 13:44:48.884 INFO VariantRecalibrator - Executing as root@hostname-a257c5e4e5 on Linux v5.4.0-1042-gcp amd64 13:44:48.884 INFO VariantRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 13:44:48.884 INFO VariantRecalibrator - Start Date/Time: August 10, 2022 1:44:48 PM GMT 13:44:48.884 INFO VariantRecalibrator - ------------------------------------------------------------ 13:44:48.885 INFO VariantRecalibrator - ------------------------------------------------------------ 13:44:48.885 INFO VariantRecalibrator - HTSJDK Version: 2.24.1 13:44:48.885 INFO VariantRecalibrator - Picard Version: 2.27.1 13:44:48.885 INFO VariantRecalibrator - Built for Spark Version: 2.4.5 13:44:48.885 INFO VariantRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2 13:44:48.885 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 13:44:48.885 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 13:44:48.885 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 13:44:48.885 INFO VariantRecalibrator - Deflater: IntelDeflater 13:44:48.886 INFO VariantRecalibrator - Inflater: IntelInflater 13:44:48.886 INFO VariantRecalibrator - GCS max retries/reopens: 20 13:44:48.886 INFO VariantRecalibrator - Requester pays: disabled 13:44:48.886 INFO VariantRecalibrator - Initializing engine 13:44:50.505 INFO FeatureManager - Using codec VCFCodec to read file gs://gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz 13:44:52.638 INFO FeatureManager - Using codec VCFCodec to read file gs://gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz 13:44:54.498 INFO FeatureManager - Using codec VCFCodec to read file gs://gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz 13:44:56.548 INFO FeatureManager - Using codec VCFCodec to read file gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf 13:44:58.033 INFO FeatureManager - Using codec VCFCodec to read file file:///io/batch/005ed6/inputs/JqoVH/gnomad_genomes_v3.1_info.vcf.bgz 13:44:58.445 INFO VariantRecalibrator - Done initializing engine 13:44:58.448 INFO TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0 13:44:58.448 INFO TrainingSet - Found omni track: Known = false Training = true Truth = true Prior = Q12.0 13:44:58.448 INFO TrainingSet - Found 1000G track: Known = false Training = true Truth = false Prior = Q10.0 13:44:58.449 INFO TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q7.0 13:44:58.493 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "". Defaulting to VCF. 13:44:58.539 INFO ProgressMeter - Starting traversal 13:44:58.539 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 13:44:58.567 INFO VariantRecalibrator - Shutting down engine [August 10, 2022 1:44:58 PM GMT] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 0.16 minutes. Runtime.totalMemory()=121422479360 org.broadinstitute.hellbender.exceptions.GATKException: Exception thrown at chr1:10007 [VC /io/batch/005ed6/inputs/JqoVH/gnomad_genomes_v3.1_info.vcf.bgz @ chr1:10007 Q. of type=SNP alleles=[T*, C] attr={AC=0, AC_raw=1, AS_FS=3.97940, AS_MQ=32.2452, AS_MQRankSum=0.358000, AS_QD=3.60000, AS_QUALapprox=|18, AS_ReadPosRankSum=1.23100, AS_SB_TABLE=[3, 0|1, 1], AS_VarDP=|5, AS_pab_max=1.00000, FS=3.97940, MQ=32.2452, MQRankSum=0.358000, QD=3.60000, QUALapprox=18, ReadPosRankSum=1.23100, SB=[3, 0, 1, 1], VarDP=5} GT=[] filters= at org.broadinstitute.hellbender.engine.MultiVariantWalker.lambda$traverse$1(MultiVariantWalker.java:145) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) at org.broadinstitute.hellbender.engine.MultiVariantWalker.traverse(MultiVariantWalker.java:136) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1085) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Caused by: java.lang.IndexOutOfBoundsException: Index: 0 at java.util.Collections$EmptyList.get(Collections.java:4456) at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantDataManager.decodeAnnotation(VariantDataManager.java:352) at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantDataManager.decodeAnnotations(VariantDataManager.java:328) at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.addDatum(VariantRecalibrator.java:602) at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.addVariantDatum(VariantRecalibrator.java:578) at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.lambda$consumeQueuedVariants$0(VariantRecalibrator.java:543) at java.util.ArrayList.forEach(ArrayList.java:1257) at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.consumeQueuedVariants(VariantRecalibrator.java:543) at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.apply(VariantRecalibrator.java:522) at org.broadinstitute.hellbender.engine.MultiVariantWalker.lambda$traverse$1(MultiVariantWalker.java:139) ... 20 more
Please assist
-
Hi GATK Team, I am checking in to see if someone has seen this issue?
-
Hi Lindo Nkambule,
Thank you so much for the reminder, I'm sorry that we didn't get to your post sooner.
It looks like this error is coming up because there is a mismatch between the VCF INFO field and the annotations at the variant site given in the error message (chr1:10007 in /io/batch/005ed6/inputs/JqoVH/gnomad_genomes_v3.1_info.vcf.bgz). How did you create or obtain or create this VCF? Do you know if it was created with GATK tools?
To solve this issue, I would first recommend running the ValidateVariants tool. This tool will make sure that the VCF is in the correct formatting. You can also take a look at that variant site (chr1:10007) and see if there is an issue with any of the annotations in the INFO field as compared to the header.
Please let me know how this goes and if you have any more questions.
Best,
Genevieve
-
I've tried using ValidateVariants but it doesn't output anything. This is what I get
14:17:33.276 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so 14:17:33.400 INFO ValidateVariants - ------------------------------------------------------------ 14:17:33.401 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.2.6.1 14:17:33.401 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/ 14:17:33.401 INFO ValidateVariants - Executing as root@hostname-21558eb86b on Linux v5.4.0-1042-gcp amd64 14:17:33.401 INFO ValidateVariants - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 14:17:33.401 INFO ValidateVariants - Start Date/Time: August 19, 2022 2:17:33 PM GMT 14:17:33.401 INFO ValidateVariants - ------------------------------------------------------------ 14:17:33.401 INFO ValidateVariants - ------------------------------------------------------------ 14:17:33.401 INFO ValidateVariants - HTSJDK Version: 2.24.1 14:17:33.401 INFO ValidateVariants - Picard Version: 2.27.1 14:17:33.402 INFO ValidateVariants - Built for Spark Version: 2.4.5 14:17:33.402 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2 14:17:33.402 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 14:17:33.402 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 14:17:33.402 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 14:17:33.402 INFO ValidateVariants - Deflater: IntelDeflater 14:17:33.402 INFO ValidateVariants - Inflater: IntelInflater 14:17:33.402 INFO ValidateVariants - GCS max retries/reopens: 20 14:17:33.402 INFO ValidateVariants - Requester pays: disabled 14:17:33.402 INFO ValidateVariants - Initializing engine 14:17:35.049 INFO FeatureManager - Using codec VCFCodec to read file file:///io/batch/1679f0/inputs/B1Pqb/Homo_sapiens_assembly38.dbsnp138.vcf.gz 14:17:35.245 INFO FeatureManager - Using codec VCFCodec to read file file:///io/batch/1679f0/inputs/fwU01/gnomad_genomes_v3.1_info.vcf.bgz 14:17:35.675 INFO ValidateVariants - Done initializing engine 14:17:36.538 INFO ProgressMeter - Starting traversal 14:17:36.538 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute Using GATK jar /gatk/gatk-package-4.2.6.1-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms60g -jar /gatk/gatk-package-4.2.6.1-local.jar ValidateVariants -R gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -V /io/batch/1679f0/inputs/fwU01/gnomad_genomes_v3.1_info.vcf.bgz --dbsnp /io/batch/1679f0/inputs/B1Pqb/Homo_sapiens_assembly38.dbsnp138.vcf.gz
Please assist
-
Thanks Lindo Nkambule! Could you also let me know the answers to the other questions I posted?
-
The VCF was created by the gnomAD team and it's not yet public unfortunately, I am using it for testing a pipeline. I am not sure which software was used to generate the VCF but I'm almost certain that the allele-specific annotations were added using gnomAD functionality
-
Thanks Lindo Nkambule! Is it possible to share the first couple variant lines of the file? We would like to check the INFO field annotations. You can send via my Broad email, if that works for you.
-
Lindo Nkambule thanks for sending over the VCF snippet! Could you also provide the exact VariantRecalibrator command line used? The one you gave in the post has a few variables, it would be helpful to see what was actually run.
-
Sure! Below is the command:
gatk --java-options "-Xms118g -XX:+UseParallelGC -XX:ParallelGCThreads=14" \ VariantRecalibrator \ -V gs://hgdp-1kg/vqsr_pipeline/gnomad_genomes_v3.1_info.vcf.bgz \ -O ${BATCH_TMPDIR}/VQSR__SNPsVariantRecalibratorCreateModel-Dx8eQ/recalibration \ --tranches-file ${BATCH_TMPDIR}/VQSR__SNPsVariantRecalibratorCreateModel-Dx8eQ/tranches \ --trust-all-polymorphic \ -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \ -an AS_QD -an AS_MQRankSum -an AS_ReadPosRankSum -an AS_FS -an AS_SOR -an AS_MQ \ -mode SNP \ --use-allele-specific-annotations \ --sample-every-Nth-variant 75 \ --output-model ${BATCH_TMPDIR}/VQSR__SNPsVariantRecalibratorCreateModel-Dx8eQ/model_file \ --max-gaussians 6 \ -resource:hapmap,known=false,training=true,truth=true,prior=15 gs://gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz \ -resource:omni,known=false,training=true,truth=true,prior=12 gs://gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz \ -resource:1000G,known=false,training=true,truth=false,prior=10 gs://gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz \ -resource:dbsnp,known=true,training=false,truth=false,prior=7 gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz \ --rscript-file ${BATCH_TMPDIR}/VQSR__SNPsVariantRecalibratorCreateModel-Dx8eQ/snp_rscript
-
Thank you Lindo Nkambule! Could you check and see if the AS_SOR annotation is missing from all the records in the VCF?
You are specifying the annotation in your VariantRecalibrator command but it is most likely missing in your VCF (-an AS_SOR).
Potentially this annotation was not added upstream in your pipeline at one point. Do you know if that might be the case?
-
Hi Genevieve Brandt (she/her), that was the case. Running the command without -an AS_SOR doesn't give issues.
Thank you!
-
Thank you Lindo Nkambule!
-
Hi Lindo Nkambule,
I have a follow up question regarding how you are running VQSR. We noticed that you are using the argument --sample-every-Nth-variant 75 in your VariantRecalibrator command. With this parameter, you are throwing away around 98.7% of your data and then only using 16k training sites.
I am wondering if you are using this parameter just for training purposes, or for production as well? If it's for production, we might want to find a better way to optimize VariantRecalibrator because this parameter could lead to a model that is not trained well enough.
Let me know your thoughts about this. We want to make sure that VQSR can perform well for your use case!
Best regards,
Genevieve
-
Hi Lindo Nkambule,
Just wanted to follow up with you as well. To clarify Genevieve's comment, when setting that particular parameter to 75, that means you are only retaining 1 out of every 75 variants (and hence, are throwing away 74 / 75 = 98.7% of your data).
I was doing some digging about this parameter on Slack and happened to notice this JSON: https://github.com/broadinstitute/hail-ukbb-200k-callset/blob/21c5d2c084d75bb8706e7968ae8ac536c4b855ab/GenotypeAndFilter.AS.json It seems a previous UKBB callset was run with this parameter set to 75 as well, which led to only 16k training sites for VQSR being retained in that particular callset. I'm not sure what the corresponding number of training sites is for your current callset.
More generally, I'd like to understand what groups at the Broad might still be running VQSR with these parameters. If this is something you'd like to discuss, feel free to contact me via Slack or Broad email!
-
Hi Genevieve Brandt (she/her) and Samuel Lee
Apologies for the late response. My primary goal was to write this pipeline here in Hail Batch and the defaults are 75 for SNPs and 10 for INDELs, I haven't done much digging into the number of training sites and what the outputs look like. I was using the gnomAD VCF for testing purposes. I'm almost done with getting every step of the pipeline running, can we schedule a meeting to go through some of the results to get your inputs on the parameters and results? I should have everything working next week Wednesday latest.
Please sign in to leave a comment.
14 comments