Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Insufficient variance error for VariantRecalibrator

0

5 comments

  • Avatar
    Pamela Bretscher

    Hi Matt Snyder,

    Thank you for your question and for checking similar forum posts for possible solutions. One workaround you could try is to increase the number of --max-attempts when running VariantRecalibrator, in case the tool is failing due to a sampling error. You could also try reducing the --max-gaussians even further (you may have seen in a similar previous post that a user specified --max-gaussians 1 and was successful). 

    It may just be that the training data you're using simply doesn't include enough variation, as you are alluding to at the end of your post. I would recommend reading through this article which outlines all of the recommended data sources for use with VariantRecalibrator. 

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Matt Snyder

    Hi Pamela,

    In the end it only works if I up the --max-attempts and drop the --max-gaussians. But it works! I am also running with "-mode BOTH", but the article you linked says "Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs". If I do so, can I just give the output and tranches files unique names for the variant types and then supply both to ApplyVQSR. E.g.:

    gatk ApplyVQSR \
    -R human_g1k_v37_decoy.fasta \
    -V input.vcf.gz \
    -O output.recalibrated.vcf.gz \
    --tranches-file output.SNP.tranches \ --recal-file output.SNP.recal \
    --tranches-file output.indel.tranches \
    --recal-file output.indel.recal \
    -mode BOTH

    The article also mentions different values for --ts_filter_level should be used for SNPs and indels. How can I differentiate the --ts_filter_level for each variant type?

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Matt Snyder,

    I'm glad you were able to get it to work! As mentioned in the VariantRecalibrator tool documentation, the BOTH mode isn't recommended to be used in variant analysis but rather for testing purposes. Therefore, I would recommend following the steps outlined in the previous article I linked to run separately using SNP and INDEL modes. I hope this helps answer your question.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Matt Snyder

    Hi again Pamela,

    I noticed that this helpful article points out that VQSR works best on cohort VCFs with at least 30 samples. In our organization we do not create cohort VCFs. Each VCF is only for a single sample, most often for a targeted panel and sometimes for WES or WGS. Is VQSR even advisable in our scenario? When I run VQSR on SNPs and INDELs separately, I get this error for the INDEL call to VariantRecalibrator unless I drop the --max-gaussians to 1:

    A USER ERROR has occurred: Positive training model failed to converge. One or more annotations (usually MQ) may have insufficient variance. Please consider lowering the maximum number of Gaussians allowed for use in the model (via --max-gaussians 4, for example).

    I'm just a little worried about the stability of these analyses. Could we possibly encounter a VCF with so little variance that even "--max-gaussians 1" cannot get the analysis to runs successfully?

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Matt Snyder,

    Yes, VQSR works best with a larger number of samples as it relies on machine learning to build a model from a large number of variants. I would say it is very likely that you will continue running into issues with having too little variance if you are using only a single sample and the results may be less accurate given that the tool doesn't have a lot of data to use. I would recommend that you use hard-filtering instead for your use case. If you want, you can try to continue with VQSR and it may work without errors, but it likely isn't the most reliable for your data.

    Kind regards,

    Pamela

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk