The problem:
In order to filter variants after the calling step, we prefer to use VQSR (a.k.a. recalibration).
However, it requires well-curated training resources, which are typically not available for organisms other than humans. It also requires a large amount of variant sites to operate properly, so it is not suitable for some small-scale experiments, such as targeted gene panels or exome studies with fewer than 30 exomes. For the latter, it is sometimes possible to pad your cohort with exomes from another study. This is especially true for humans -- where you can use 1000 Genomes or ExAC! -- but for non-human organisms it is often not possible to do this.
The solution: hard-filtering
If the previous statements apply to you, and you are sure that you cannot use VQSR, then you will need to use the VariantFiltration tool to hard-filter your variants. To do this, you will need to compose filter expressions using JEXL as explained here based on the generic filter recommendations detailed below. There is a tutorial that shows how to achieve this step by step. Be sure to also read the documentation explaining how to understand and improve upon the generic hard filtering recommendations.
Caveats
There is no magic formula that will give you perfect results. Filtering variants manually, using thresholds on annotation values, is subject to all sorts of caveats. The appropriateness of both the annotations and the threshold values is very highly dependent on the specific callset, how it was called, what the data was like, and what organism it belongs to, among other things.
However, because we want to help, we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data.
That said, you ABSOLUTELY SHOULD NOT expect to run these commands and be done with your analyses. You absolutely SHOULD expect to have to evaluate your results critically and TRY AGAIN with some parameter adjustments until you find the settings that are right for your data.
In addition, please note that these recommendations are mainly designed for dealing with very small data sets (in terms of both number of samples or size of targeted regions). If you are not using VQSR because you do not have training/truth resources available for your organism, then you should expect to have to do even more tweaking on the filtering parameters.
Filtering recommendations
Here are some recommended arguments to use with VariantFiltration when ALL other options are unavailable to you. Be sure to read the documentation explaining how to understand and improve upon these recommendations.
Note that these JEXL expressions will tag as filtered any sites where the annotation value matches the expression. So if you use the expression QD < 2.0
, any site with a QD lower than 2 will be tagged as failing that filter.
For SNPs:
QD < 2.0
MQ < 40.0
FS > 60.0
SOR > 3.0
MQRankSum < -12.5
ReadPosRankSum < -8.0
If your callset was generated with UnifiedGenotyper for legacy reasons, you can add HaplotypeScore > 13.0
.
For indels:
QD < 2.0
ReadPosRankSum < -20.0
InbreedingCoeff < -0.8
FS > 200.0
SOR > 10.0
And now some more IMPORTANT caveats (don't skip this!)
The InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.
For shallow-coverage (<10x), it is virtually impossible to use manual filtering to reliably separate true positives from false positives. You really, really, really should use the protocol involving variant quality score recalibration. If you can't do that, maybe you need to take a long hard look at your experimental design. In any case you're probably in for a world of pain.
The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.
Finally, a note of hope
Some bits of this article may seem harsh, or depressing. Sorry. We believe in giving you the cold hard truth.
HOWEVER, we do understand that this is one of the major points of pain that GATK users encounter -- along with understanding how VQSR works, so really, whichever option you go with, you're going to suffer.
And we do genuinely want to help. So although we can't look at every single person's callset and give an opinion on how it looks (no, seriously, don't ask us to do that), we do want to hear from you about how we can best help you help yourself. What information do you feel would help you make informed decisions about how to set parameters? Are the meanings of the annotations not clear? Would knowing more about how they are computed help you understand how you can use them? Do you want more math? Less math, more concrete examples?
Tell us what you'd like to see here, and we'll do our best to make it happen. (no unicorns though, we're out of stock)
We also welcome testimonials from you. We are one small team; you are a legion of analysts all trying different things. Please feel free to come forward and share your findings on what works particularly well in your hands.
1 comment
I'd like to suggest a tool that makes it easier to visualize the INFO and FORMAT fields that get computed in an unfiltered VCF. All the work that VQSR does is it "magically" shows the impact of thresholds on precision recall curves. VQSR uses it's own terminology which adds to the mystique when trying to explain it to other aka normal people. So the tool(s) should do the following in steps of importance.
1) Rip out all the VCF INFO and FORMAT statistics into everyone's favourite format (or two) (CSV?). This should be a commonly used format for R/SAS/Excel/IGV as well. i.e. easy and ergonomic Export of the VCF annotations.
This might be all people need so they can do their own plots.
2) make it easy to draw histograms and density bivarite plots of the data exported from step 1. This is purposefully close to how VQSR operates but the visualization should be described in standard statistical terminology. People can then start to understand how their own data is distributed and where they would set their hard filters. reuse an existing tool like RGobi, GGPlot? Or build your own to use a domain appropriate user interface.
3) optional credit is to add a slider bar tool to plot precision/recall curves given a known good dataset. Again very close to VQSR in thinking but "tilted" to using a standard terminology to make it as approachable as any other data analysis.
Please sign in to leave a comment.