New version of GATK leads to VariantRecalibrator error.
AnsweredThis post is a reminder for other GATK users/team. This "The provided reference alleles do not appear to represent the same position" error most likely occured because of the same reason as mentioned at https://github.com/broadinstitute/gatk/issues/6701
===================
Deitail
I followed the germline variants calling best practice. VariantRecalibrator SNP model ran normally, but an error occured at VariantRecalibrator INDELmodel. I re-run the VariantRecalibrator INDEL model comand in following two situations separately , both end normally.
1, run without dbsnp resource
2, downgrade GATK from 4.1.9.0 to 4.1.4.0
===================
If you are seeing an error, please provide(REQUIRED) :
a) GATK version used: 4.1.9.0
b) Exact command used:
~/bin/gatk-4.1.9.0/gatk --java-options -Xms24g VariantRecalibrator -V temp/vartiant_germline/sites.only.vcf.gz -O temp/vartiant_germline/recaliberation.indel.vcf --tranches-file temp/vartiant_germline/tranches.indel.txt --trust-all-polymorphic -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 -an DP -an FS -an MQRankSum -an QD -an ReadPosRankSum -an SOR -mode INDEL --max-gaussians 4 -resource:mills,known=false,training=true,truth=true,prior=12 ~/db/mutect2_support/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf.gz -resource:dbsnp,known=true,training=false,truth=false,prior=2 ~/db/mutect2_support/b37/hg19_v0_dbsnp_138.b37.vcf.gz -resource:axiomPoly,known=false,training=true,truth=false,prior=10 ~/db/mutect2_support/b37/Axiom_Exome_Plus.genotypes.all_populations.poly.b37.vcf.gz --use-allele-specific-annotations
c) Entire error log:
Using GATK jar ~/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms24g -jar ~/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar VariantRecalibrator -V temp/vatiant_germline/sites.only.vcf.gz -O temp/vatiant_germline/recaliberation.indel.vcf --tranches-file temp/vatiant_germline/tranches.indel.txt --trust-all-polymorphic -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 -an DP -an FS -an MQRankSum -an QD -an ReadPosRankSum -an SOR -mode INDEL --max-gaussians 4 -resource:mills,known=false,training=true,truth=true,prior=12 ~/db/mutect2_support/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf.gz -resource:dbsnp,known=true,training=false,truth=false,prior=2 ~/db/mutect2_support/b37/hg19_v0_dbsnp_138.b37.vcf.gz --use-allele-specific-annotations -resource:axiomPoly,known=false,training=true,truth=false,prior=10 ~/db/mutect2_support/b37/Axiom_Exome_Plus.genotypes.all_populations.poly.b37.vcf.gz
14:58:10.389 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:~/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Nov 12, 2020 2:58:10 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
14:58:10.555 INFO VariantRecalibrator - ------------------------------------------------------------
14:58:10.555 INFO VariantRecalibrator - The Genome Analysis Toolkit (GATK) v4.1.9.0
14:58:10.555 INFO VariantRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
14:58:10.555 INFO VariantRecalibrator - Executing as y@c001 on Linux v3.10.0-957.el7.x86_64 amd64
14:58:10.555 INFO VariantRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
14:58:10.556 INFO VariantRecalibrator - Start Date/Time: November 12, 2020 2:58:10 PM CST
14:58:10.556 INFO VariantRecalibrator - ------------------------------------------------------------
14:58:10.556 INFO VariantRecalibrator - ------------------------------------------------------------
14:58:10.556 INFO VariantRecalibrator - HTSJDK Version: 2.23.0
14:58:10.556 INFO VariantRecalibrator - Picard Version: 2.23.3
14:58:10.556 INFO VariantRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:58:10.556 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:58:10.556 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:58:10.556 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:58:10.556 INFO VariantRecalibrator - Deflater: IntelDeflater
14:58:10.556 INFO VariantRecalibrator - Inflater: IntelInflater
14:58:10.556 INFO VariantRecalibrator - GCS max retries/reopens: 20
14:58:10.556 INFO VariantRecalibrator - Requester pays: disabled
14:58:10.557 INFO VariantRecalibrator - Initializing engine
14:58:10.823 INFO FeatureManager - Using codec VCFCodec to read file file://~/db/mutect2_support/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf.gz
14:58:10.963 INFO FeatureManager - Using codec VCFCodec to read file file://~/db/mutect2_support/b37/hg19_v0_dbsnp_138.b37.vcf.gz
14:58:11.067 INFO FeatureManager - Using codec VCFCodec to read file file://~/db/mutect2_support/b37/Axiom_Exome_Plus.genotypes.all_populations.poly.b37.vcf.gz
14:58:11.090 INFO FeatureManager - Using codec VCFCodec to read file file://~/projects/test2/temp/vatiant_germline/sites.only.vcf.gz
14:58:11.139 INFO VariantRecalibrator - Done initializing engine
14:58:11.142 INFO TrainingSet - Found mills track: Known = false Training = true Truth = true Prior = Q12.0
14:58:11.142 INFO TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q2.0
14:58:11.142 INFO TrainingSet - Found axiomPoly track: Known = false Training = true Truth = false Prior = Q10.0
14:58:11.167 INFO ProgressMeter - Starting traversal
14:58:11.168 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
14:58:21.182 INFO ProgressMeter - 2:23974966 0.2 22000 131828.6
14:58:31.703 INFO ProgressMeter - 3:171904490 0.3 46000 134404.7
14:58:41.753 INFO ProgressMeter - 6:18264210 0.5 67000 131441.3
14:58:42.144 INFO VariantRecalibrator - Shutting down engine
[November 12, 2020 2:58:42 PM CST] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 0.53 minutes.
Runtime.totalMemory()=29244260352
java.lang.IllegalStateException: The provided reference alleles do not appear to represent the same position, AC* vs. AA*
at org.broadinstitute.hellbender.utils.variant.GATKVariantContextUtils.determineReferenceAllele(GATKVariantContextUtils.java:209)
at org.broadinstitute.hellbender.utils.variant.GATKVariantContextUtils.isAlleleInList(GATKVariantContextUtils.java:164)
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantDataManager.doAllelesMatch(VariantDataManager.java:424)
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantDataManager.parseTrainingSets(VariantDataManager.java:399)
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.addDatum(VariantRecalibrator.java:614)
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.addVariantDatum(VariantRecalibrator.java:577)
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.lambda$consumeQueuedVariants$0(VariantRecalibrator.java:542)
at java.util.ArrayList.forEach(ArrayList.java:1251)
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.consumeQueuedVariants(VariantRecalibrator.java:542)
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.apply(VariantRecalibrator.java:521)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.lambda$traverse$1(MultiVariantWalker.java:120)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.traverse(MultiVariantWalker.java:118)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1049)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
-
Hi woodword,
In the issue you linked to, the issue was not a GATK bug, but a new check in GATK that revealed an existing issue in the user's data. The fix was to add more information to the error message so that users can find the problems more easily.
I can update the issue with your example and determine if the fix was successful, since your error message does not have location information. Did this issue persist when you tried GATK 4.1.4.0?
Please double check your data because this is not a GATK issue, but an issue with your reference alleles being inconsistent, which indicates using inconsistent reference versions.
-
I am not sure where the problem is. I've tried the every version of GATK ( 4.1.4.0 to 4.1.9.0) with the same data and resource files. The issue disappeared when I ran 4.1.4.0 and 4.1.4.1. GATK 4.1.5.0 or newer will lead to this problem,
By the way the dbsnp resource (dbsnp_138.b37.vcf.gz) I used has a MD5 of fb24e974627684d6a7e455a450a4d405, I hope I didn't download the wrong file.
-
Hi woodword, yes, you are correct, there was a new check introduced in GATK 4.1.5.0 that throws an error when there are issues with the reference file. I have created an issue ticket here so that we can improve the error message. The improved error message will help you find the location of the issue so you can fix your file and run the tool.
-
Just wanted to follow up that we have merged a change to improve the error message and the fix will be in the next release.
-
Hi,
I hit the same issue using VariantRecalibrator from GATK release 4.1.9.0 so I upgraded to 4.2.0.0 to check the position and it turns out that it was at position 6:29857105 with the following error description:
Caused by: java.lang.IllegalStateException: The provided reference alleles do not appear to represent the same position, AC* vs. AA
Now, checking the dbsnp_138.b37.vcf.gz file for that position gave:
6 29857105 rs201835144 A C . . OTHERKG;RS=201835144;RSPOS=29857105;SAO=0;SSR=0;VC=SNV;VLD;VP=0x050000000001040002000100;WGT=1;dbSNPBuildID=137
6 29857105 rs9278395 AA A,AC . . GNO;NOC;OTHERKG;RS=9278395;RSPOS=29857106;SAO=0;SLO;SSR=0;VC=DIV;VP=0x050100000001000102000210;WGT=1;dbSNPBuildID=118
6 29857105 rs202000432 AC A . . OTHERKG;RS=202000432;RSPOS=29857111;SAO=0;SSR=0;VC=DIV;VP=0x050000000001000002000200;WGT=1;dbSNPBuildID=137and checking my own vcf for that same position gave:
6 29857105 . AC A 138.92 . AC=2;AF=1.00;AN=2;AS_BaseQRankSum=.;AS_FS=0.000;AS_MQ=22.28;AS_MQRankSum=.;AS_QD=27.80;AS_ReadPosRankSum=.;AS_SOR=3.611;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=22.95;QD=27.78;SOR=3.611 GT:AD:DP:GQ:PL 1/1:0,5:5:15:153,15,0
Please, could you help me to see what is going wrong?
Thank you very much.
Regards.
Ahmed -
Hi Ahmed,
Thanks for giving this example, it looks like there is an issue with this dbSNP file which is causing issues with the reference context. The 2nd and 3rd records are conflicting.
You may want to use a newer dbSNP version to fix this issue.
Best,
Genevieve
-
Hi Genevieve,
Thank you very much for your reply. Please, do you have any suggestions where to find a newer version of dbSNP for build37.2?
All the best.
Ahmed -
I'm not sure, do you know where you got this version of dbSNP? Is it in our data resources?
-
Absolutely, I dowloaded it from the Broad Institute ftp bundle. The google bucket is exclusively for hg38.
-
Ok I see, dbSNP versions are not always edited for these kind of issues, so there is not a lot we can do for this. The GATK Tool can't handle these sites because these are conflicting entries.
I'll look into this from our end, but I can't guarantee that we will be able to provide a fix because this is a dbSNP resource, not a resource that we made.
woodword did you ever find a workaround for this?
Please sign in to leave a comment.
10 comments