Insufficient variance error for VariantRecalibrator
Hi im getting an error trying to test the VariantRecalibrator tool on a human sample.
I'm using gatk v4.1.8.1.
Here is my command with paths edited for privacy ;)
gatk VariantRecalibrator \
-R /qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/wdl/resources/human_g1k_v37_decoy.fasta \
-V /qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/test_vcfs/vcf/PC-DS-376301.hg19.hc.vcf.gz \
--resource:goldStandard,known=false,training=true,truth=false,prior=10.0 /qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/wdl/resources/dbs/Mills_and_1000G_gold_standard.indels.b37.vcf.gz \
--resource:1000G,known=false,training=true,truth=true,prior=10.0 /qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/wdl/resources/dbs/1000G_phase3_v4_20130502.sites.vcf.gz \
--resource:dbSNP,known=true,training=false,truth=false,prior=2.0 /qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/wdl/resources/dbs/dbsnp_138.b37.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode BOTH \
-O PC-DS-376301.hg19.hc.recal \
--tranches-file PC-DS-376301.hg19.hc.tranches
Here is the error log.
Using GATK jar /home/linuxbrew/.linuxbrew/Cellar/gatk/4.1.8.1/gatk-package-4.1.8.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/linuxbrew/.linuxbrew/Cellar/gatk/4.1.8.1/gatk-package-4.1.8.1-local.jar VariantRecalibrator -R /qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/wdl/resources/human_g1k_v37_decoy.fasta -V /qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/test_vcfs/vcf/PC-DS-376301.hg19.hc.vcf.gz --resource:goldStandard,known=false,training=true,truth=false,prior=10.0 /qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/wdl/resources/dbs/Mills_and_1000G_gold_standard.indels.b37.vcf.gz --resource:1000G,known=false,training=true,truth=true,prior=10.0 /qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/wdl/resources/dbs/1000G_phase3_v4_20130502.sites.vcf.gz --resource:dbSNP,known=true,training=false,truth=false,prior=2.0 /qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/wdl/resources/dbs/dbsnp_138.b37.vcf.gz -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode BOTH -O PC-DS-376301.hg19.hc.recal --tranches-file PC-DS-376301.hg19.hc.tranches
16:29:10.991 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/linuxbrew/.linuxbrew/Cellar/gatk/4.1.8.1/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
Oct 22, 2021 4:29:12 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
16:29:12.094 INFO VariantRecalibrator - ------------------------------------------------------------
16:29:12.095 INFO VariantRecalibrator - The Genome Analysis Toolkit (GATK) v4.1.8.1
16:29:12.095 INFO VariantRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
16:29:12.095 INFO VariantRecalibrator - Executing as msnyder5@seapl0app024 on Linux v3.8.13-118.2.2.el6uek.x86_64 amd64
16:29:12.095 INFO VariantRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-b08
16:29:12.095 INFO VariantRecalibrator - Start Date/Time: October 22, 2021 4:29:10 PM PDT
16:29:12.095 INFO VariantRecalibrator - ------------------------------------------------------------
16:29:12.095 INFO VariantRecalibrator - ------------------------------------------------------------
16:29:12.096 INFO VariantRecalibrator - HTSJDK Version: 2.23.0
16:29:12.096 INFO VariantRecalibrator - Picard Version: 2.22.8
16:29:12.096 INFO VariantRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:29:12.096 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:29:12.096 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:29:12.096 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:29:12.096 INFO VariantRecalibrator - Deflater: IntelDeflater
16:29:12.096 INFO VariantRecalibrator - Inflater: IntelInflater
16:29:12.097 INFO VariantRecalibrator - GCS max retries/reopens: 20
16:29:12.097 INFO VariantRecalibrator - Requester pays: disabled
16:29:12.097 INFO VariantRecalibrator - Initializing engine
16:29:12.509 INFO FeatureManager - Using codec VCFCodec to read file file:///qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/wdl/resources/dbs/Mills_and_1000G_gold_standard.indels.b37.vcf.gz
16:29:12.645 INFO FeatureManager - Using codec VCFCodec to read file file:///qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/wdl/resources/dbs/1000G_phase3_v4_20130502.sites.vcf.gz
16:29:12.789 INFO FeatureManager - Using codec VCFCodec to read file file:///qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/wdl/resources/dbs/dbsnp_138.b37.vcf.gz
16:29:12.924 INFO FeatureManager - Using codec VCFCodec to read file file:///qual_control/users/msnyder5/DNAnexus/GATK_DNAseq/test_vcfs/vcf/PC-DS-376301.hg19.hc.vcf.gz
16:29:13.003 INFO VariantRecalibrator - Done initializing engine
16:29:13.008 INFO TrainingSet - Found goldStandard track: Known = false Training = true Truth = false Prior = Q10.0
16:29:13.008 INFO TrainingSet - Found 1000G track: Known = false Training = true Truth = true Prior = Q10.0
16:29:13.009 INFO TrainingSet - Found dbSNP track: Known = true Training = false Truth = false Prior = Q2.0
16:29:13.016 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "recal". Defaulting to VCF.
16:29:13.048 INFO ProgressMeter - Starting traversal
16:29:13.048 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
16:29:23.955 INFO ProgressMeter - unmapped 0.2 757 4164.3
16:29:23.956 INFO ProgressMeter - Traversal complete. Processed 757 total variants in 0.2 minutes.
16:29:23.956 INFO VariantDataManager - QD: mean = 17.08 standard deviation = 8.67
16:29:23.956 INFO VariantDataManager - MQ: mean = 59.93 standard deviation = 0.63
16:29:23.957 INFO VariantDataManager - MQRankSum: mean = -0.37 standard deviation = 3.34
16:29:23.957 INFO VariantDataManager - ReadPosRankSum: mean = 0.09 standard deviation = 1.34
16:29:23.957 INFO VariantDataManager - FS: mean = 0.48 standard deviation = 1.51
16:29:23.958 INFO VariantDataManager - SOR: mean = 0.75 standard deviation = 0.38
16:29:23.964 INFO VariantDataManager - Annotation order is: [MQ, QD, FS, SOR, MQRankSum, ReadPosRankSum]
16:29:23.964 INFO VariantDataManager - Training with 717 variants after standard deviation thresholding.
16:29:23.964 WARN VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable.
16:29:23.966 INFO GaussianMixtureModel - Initializing model with 100 k-means iterations...
16:29:24.067 INFO VariantRecalibratorEngine - Finished iteration 0.
16:29:24.119 INFO VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.62748
16:29:24.142 INFO VariantRecalibratorEngine - Convergence after 9 iterations!
16:29:24.151 WARN VariantRecalibratorEngine - Model could not pre-compute denominators. Denominator for gaussian evaluation cannot be computed. Covariance determinant is -4.492440700156467E-57. One or more annotations (usually MQ) may have insufficient variance.
16:29:24.405 INFO VariantRecalibrator - Shutting down engine
[October 22, 2021 4:29:24 PM PDT] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 0.22 minutes.
Runtime.totalMemory()=3291480064
***********************************************************************
A USER ERROR has occurred: Positive training model failed to converge. One or more annotations (usually MQ) may have insufficient variance. Please consider lowering the maximum number of Gaussians allowed for use in the model (via --max-gaussians 4, for example).
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
I get the same error even after I include "--max-gaussians 4". I also noticed on the FAQ page that it says not to do this unless "working with a non-model organism for which there are no available genomes or exomes that you can use to supplement your own cohort." This is a human sample.
This website says that the HpMap project was decommissioned. It also says: "The number of novel variants is constantly increasing and many believe that the 1000 Genomes Project could potentially overshadow the utility of HapMap." Do you think it is a good idea to use the 1000G phase 3 variants as truth variants? Could doing so somehow be the source of my error? It seems that maybe the annotations in my --resource files are not different enough? You may have noticed all my vcf resources come from the gatk bundle37.
I checked the "Similar posts" and none are helpful in resolving this.
Thanks!
-
Hi Matt Snyder,
Thank you for your question and for checking similar forum posts for possible solutions. One workaround you could try is to increase the number of --max-attempts when running VariantRecalibrator, in case the tool is failing due to a sampling error. You could also try reducing the --max-gaussians even further (you may have seen in a similar previous post that a user specified --max-gaussians 1 and was successful).
It may just be that the training data you're using simply doesn't include enough variation, as you are alluding to at the end of your post. I would recommend reading through this article which outlines all of the recommended data sources for use with VariantRecalibrator.
Kind regards,
Pamela
-
Hi Pamela,
In the end it only works if I up the --max-attempts and drop the --max-gaussians. But it works! I am also running with "-mode BOTH", but the article you linked says "Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs". If I do so, can I just give the output and tranches files unique names for the variant types and then supply both to ApplyVQSR. E.g.:
gatk ApplyVQSR \
-R human_g1k_v37_decoy.fasta \
-V input.vcf.gz \
-O output.recalibrated.vcf.gz \
--tranches-file output.SNP.tranches \ --recal-file output.SNP.recal \
--tranches-file output.indel.tranches \
--recal-file output.indel.recal \
-mode BOTHThe article also mentions different values for --ts_filter_level should be used for SNPs and indels. How can I differentiate the --ts_filter_level for each variant type?
Thanks!
-
Hi Matt Snyder,
I'm glad you were able to get it to work! As mentioned in the VariantRecalibrator tool documentation, the BOTH mode isn't recommended to be used in variant analysis but rather for testing purposes. Therefore, I would recommend following the steps outlined in the previous article I linked to run separately using SNP and INDEL modes. I hope this helps answer your question.
Kind regards,
Pamela
-
Hi again Pamela,
I noticed that this helpful article points out that VQSR works best on cohort VCFs with at least 30 samples. In our organization we do not create cohort VCFs. Each VCF is only for a single sample, most often for a targeted panel and sometimes for WES or WGS. Is VQSR even advisable in our scenario? When I run VQSR on SNPs and INDELs separately, I get this error for the INDEL call to VariantRecalibrator unless I drop the --max-gaussians to 1:
A USER ERROR has occurred: Positive training model failed to converge. One or more annotations (usually MQ) may have insufficient variance. Please consider lowering the maximum number of Gaussians allowed for use in the model (via --max-gaussians 4, for example).
I'm just a little worried about the stability of these analyses. Could we possibly encounter a VCF with so little variance that even "--max-gaussians 1" cannot get the analysis to runs successfully?
Thanks!
-
Hi Matt Snyder,
Yes, VQSR works best with a larger number of samples as it relies on machine learning to build a model from a large number of variants. I would say it is very likely that you will continue running into issues with having too little variance if you are using only a single sample and the results may be less accurate given that the tool doesn't have a lot of data to use. I would recommend that you use hard-filtering instead for your use case. If you want, you can try to continue with VQSR and it may work without errors, but it likely isn't the most reliable for your data.
Kind regards,
Pamela
Please sign in to leave a comment.
5 comments