Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Bug in VariantRecalibrator: Data not found

0

4 comments

  • Official comment
    Avatar
    Laura Gauthier

    The part of your log that caught my attention is the

    Model could not pre-compute denominators.

    I believe this happens when the covariance matrix is not invertible, usually because the variance of one of the annotations in near zero.  The MQ standard deviation is admittedly not zero, but proportionally quite small.

    You have two paths forward:

    1. Try removing the MQ annotation.  If you're really concerned about bad MQ variants, you can do some supplemental hard filtering.

    2. Try the new Variant Extract-Train-Score (VETS) pipeline for variant filtration: https://github.com/broadinstitute/gatk/blob/master/scripts/vcf_site_level_filtering_wdl/JointVcfFiltering.wdl. That pipeline defaults to an outlier detection model that's similar to a random forest and far more robust to numerical instability.

    We're moving towards phasing out VQSR in the best practices in favor of the new VETS pipeline.

     

     

    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi G E

    Have you tried using the latest GATK 4.4 for this workflow. It may be possible that there is a fix for this issue somewhere along with the changes. 

    Also you may try to reduce the number of gaussians or you may remove it completely to try to see if your analysis completes without issues. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    G E

    I just tested GATK 4.4 - it gives the same error.

    Reducing max gaussian is not a fix, because per Specs of this tool it should work on 3 whole genomes.

    So this is a bug. Can you advise on next steps?

    Thanks.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi G E

    I will ask to our team and try to get a better solution to your problem. 

    By the way you may check our documentation for VQSR. According to our best practices documentation reducing gaussians is a way to overcome this problem. 

    https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR- 

    https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering 

    The --max-gaussians parameter sets the expected number of clusters in modeling. If a dataset gives fewer distinct clusters, e.g. as can happen for smaller data, then the tool will tell you there is insufficient data with a No data found error message. In this case, try decrementing the --max-gaussians value.
    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk