Many GATK tools require sets of known variant sites to operate correctly. Each tool uses known sites differently, but their common purpose is to help distinguish true variants from false positives, which is very important to how these tools work. Without these resources, the statistical analysis of the data may be more subject to artifacts and technical biases, which can in some cases dramatically affect the sensitivity and reliability of the results.
Here are a few examples of how these resources are used by GATK tools:
In the data pre-processing pipeline, the Base Quality Score Recalibration (BQSR) tools use known variants to mask out positions where real variation is commonly found.
In the germline short variants pipeline, the Variant Quality Score Recalibration (BQSR) tools use different sets of known variants as training resources and truth sets to model true variation and artifacts. See the VQSR method documentation for more details on the different sets.
In the somatic short variants pipeline, the Contamination Estimation tools use known variants (with population allele frequencies if available) to estimate the proportion of cross-sample contamination in the case sample.
We distinguish three main categories of known variants resources, as detailed below, associated with different validation standards. All variants resources are expected by the tools to be provided in VCF format unless otherwise specified.
In the strictest sense, a known variants resource is a list of variants that have been previously identified and reported, such as dbSNP. This typically does not imply any level of systematic curation or cross-validation. Tools that take such a resource do not assume that the variant calls are all true. If you do not have a known variants resource available for your organism of interest, you can usually bootstrap one from your own data.
In this context, bootstrapping refers to the process of generating a resource using a method that is less stringent than the normal method. For example, you can bootstrap a set of known variants by doing an initial round of variant calling without doing BQSR, and applying manual hard filters to the raw callset.
A training set resource is a list of variants that is used by machine-learning based algorithms to model the properties of true variation vs. artifacts. This requires a higher standard of curation and validation of the variants that are included in the resource. Tools that take such a resource typically accept a parameter that indicates your degree of confidence in the resource. This type of resource is difficult to bootstrap, as it benefits greatly from orthogonal validation (e.g. through a different technology such as arrays or Sanger sequencing).
A truth set resource is a list of variants that is used to evaluate the quality of a variant callset (e.g. sensitivity and specificity, or recall). As such this requires the highest standard of validation, and tools that take such a resource will assume all variant calls it contains are true variation. This cannot be bootstrapped and must be generated using orthogonal validation methods.