VariantRecalibrator – GATK

Build a recalibration model to score variant quality for filtering purposes

Category Variant Filtering

Overview

Build a recalibration model to score variant quality for filtering purposes

This tool performs the first pass in a two-stage process called Variant Quality Score Recalibration (VQSR). Specifically, it builds the model that will be used in the second step to actually filter variants. This model attempts to describe the relationship between variant annotations (such as QD, MQ and ReadPosRankSum, for example) and the probability that a variant is a true genetic variant versus a sequencing or data processing artifact. It is developed adaptively based on "true sites" provided as input, typically HapMap sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array (in humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The result is a score called the VQSLOD that gets added to the INFO field of each variant. This score is the log odds of being a true variant versus being false under the trained Gaussian mixture model.

Summary of the VQSR procedure

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. These probabilities can then be used to filter the variants with a greater level of accuracy and flexibility than can typically be achieved by traditional hard-filter (filtering on individual annotation value thresholds). The first pass consists of building a model that describes how variant annotation values co-vary with the truthfulness of variant calls in a training set, and then scoring all input variants according to the model. The second pass simply consists of specifying a target sensitivity value (which corresponds to an empirical VQSLOD cutoff) and applying filters to each variant call according to their ranking. The result is a VCF file in which variants have been assigned a score and filter status.

VQSR is probably the hardest part of the Best Practices to get right, so be sure to read the method documentation, parameter recommendations and tutorial to really understand what these tools do and how to use them for best results on your own data.

Inputs

The input variants to be recalibrated. These variant calls must be annotated with the annotations that will be used for modeling. If the calls come from multiple samples, they must have been obtained by joint calling the samples, either directly (running HaplotypeCaller on all samples together) or via the GVCF workflow (HaplotypeCaller with -ERC GVCF per-sample then GenotypeGVCFs on the resulting gVCFs) which is more scalable.
Known, truth, and training sets to be used by the algorithm. See the method documentation linked above for more details.

Outputs

A recalibration table file that will be used by the ApplyVQSR tool.
A tranches file that shows various metrics of the recalibration callset for slices of the data.

Usage example

Recalibrating SNPs in exome data

 gatk VariantRecalibrator \
   -R Homo_sapiens_assembly38.fasta \
   -V input.vcf.gz \
   --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.sites.vcf.gz \
   --resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg38.sites.vcf.gz \
   --resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
   --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf.gz \
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
   -mode SNP \
   -O output.recal \
   --tranches-file output.tranches \
   --rscript-file output.plots.R

Allele-specific version of the SNP recalibration (beta)

 gatk VariantRecalibrator \
   -R Homo_sapiens_assembly38.fasta \
   -V input.vcf.gz \
   -AS \
   --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.sites.vcf.gz \
   --resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg38.sites.vcf.gz \
   --resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
   --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf.gz \
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
   -mode SNP \
   -O output.AS.recal \
   --tranches-file output.AS.tranches \
   --rscript-file output.plots.AS.R

Note that to use the allele-specific (AS) mode, the input VCF must have been produced using allele-specific annotations in HaplotypeCaller. Note also that each allele will have a separate line in the output recalibration file with its own VQSLOD and `culprit`, which will be transferred to the final VCF by the ApplyRecalibration tool.

Caveats

The values used in the example above are only meant to show how the command lines are composed. They are not meant to be taken as specific recommendations of values to use in your own work, and they may be different from the values cited elsewhere in our documentation. For the latest and greatest recommendations on how to set parameter values for your own analyses, please read the Best Practices section of the documentation, especially the FAQ document on VQSR parameters.
Whole genomes and exomes take slightly different parameters, so make sure you adapt your commands accordingly! See the documents linked above for details.
If you work with small datasets (e.g. targeted capture experiments or small number of exomes), you will run into problems. Read the docs linked above for advice on how to deal with those issues.
In order to create the model reporting plots, the Rscript executable needs to be in your environment PATH (this is the scripting version of R, not the interactive version). See http://www.r-project.org for more information on how to download and install R.

Additional notes

This tool only accepts a single input variant file unlike earlier version of GATK, which accepted multiple input variant files.
SNPs and indels must be recalibrated in separate runs, but it is not necessary to separate them into different files. See the tutorial linked above for an example workflow. Note that mixed records are treated as indels.

Additional Information

Read filters

This Read Filter is automatically applied to the data by the Engine before processing by VariantRecalibrator.

WellformedReadFilter

VariantRecalibrator specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s)	Default value	Summary
Required Arguments
--output -O	null	The output recal file used by ApplyRecalibration
--resource	[]	A list of sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run)
--tranches-file	null	The output tranches file used by ApplyRecalibration
--use-annotation -an	[]	The names of the annotations which should used for calculations
--variant -V	[]	One or more VCF files containing variants
Optional Tool Arguments
--aggregate	[]	Additional raw input variants to be used in building the model
--arguments_file	[]	read one or more arguments files and add them to the command line
--cloud-index-prefetch-buffer -CIPB	-1	Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset.
--cloud-prefetch-buffer -CPB	40	Size of the cloud-only prefetch buffer (in MB; 0 to disable).
--disable-bam-index-caching -DBIC	false	If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified.
--disable-sequence-dictionary-validation	false	If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk!
--gcs-max-retries -gcs-retries	20	If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
--gcs-project-for-requester-pays	""	Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed.
--help -h	false	display the help message
--ignore-all-filters	false	If specified, the variant recalibrator will ignore all input filters. Useful to rerun the VQSR from a filtered output file.
--ignore-filter	[]	If specified, the variant recalibrator will also use variants marked as filtered by the specified filter name in the input VCF file
--input-model	null	If specified, the variant recalibrator will read the VQSR model from this file path.
--interval-merging-rule -imr	ALL	Interval merging rule for abutting intervals
--intervals -L	[]	One or more genomic intervals over which to operate
--mode	SNP	Recalibration mode to employ
--output-model	null	If specified, the variant recalibrator will output the VQSR model to this file path.
--reference -R	null	Reference sequence
--rscript-file	null	The output rscript file generated by the VQSR to aid in visualization of the input data and learned model
--sites-only-vcf-output	false	If true, don't emit genotype fields when writing vcf file output.
--target-titv -titv	2.15	The expected novel Ti/Tv ratio to use when calculating FDR tranches and for display on the optimization curve output figures. (approx 2.15 for whole genome experiments). ONLY USED FOR PLOTTING PURPOSES!
--truth-sensitivity-tranche -tranche	[100.0, 99.9, 99.0, 90.0]	The levels of truth sensitivity at which to slice the data. (in percent, that is 1.0 for 1 percent)
--use-allele-specific-annotations -AS	false	If specified, the variant recalibrator will attempt to use the allele-specific versions of the specified annotations.
--version	false	display the version number for this tool
Optional Common Arguments
--add-output-sam-program-record	true	If true, adds a PG tag to created SAM/BAM/CRAM files.
--add-output-vcf-command-line	true	If true, adds a command line header line to created VCF files.
--create-output-bam-index -OBI	true	If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file.
--create-output-bam-md5 -OBM	false	If true, create a MD5 digest for any BAM/SAM/CRAM file created
--create-output-variant-index -OVI	true	If true, create a VCF index when writing a coordinate-sorted VCF file.
--create-output-variant-md5 -OVM	false	If true, create a a MD5 digest any VCF file created.
--disable-read-filter -DF	[]	Read filters to be disabled before analysis
--disable-tool-default-read-filters	false	Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on)
--exclude-intervals -XL	[]	One or more genomic intervals to exclude from processing
--gatk-config-file	null	A configuration file to use with the GATK.
--input -I	[]	BAM/SAM/CRAM file containing reads
--interval-exclusion-padding -ixp	0	Amount of padding (in bp) to add to each interval you are excluding.
--interval-padding -ip	0	Amount of padding (in bp) to add to each interval you are including.
--interval-set-rule -isr	UNION	Set merging approach to use for combining interval inputs
--lenient -LE	false	Lenient processing of VCF files
--QUIET	false	Whether to suppress job-summary info on System.err.
--read-filter -RF	[]	Read filters to be applied before analysis
--read-index	[]	Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.
--read-validation-stringency -VS	SILENT	Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--seconds-between-progress-updates	10.0	Output traversal statistics every time this many seconds elapse
--sequence-dictionary	null	Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file.
--tmp-dir	null	Temp directory to use.
--use-jdk-deflater -jdk-deflater	false	Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater -jdk-inflater	false	Whether to use the JdkInflater (as opposed to IntelInflater)
--verbosity	INFO	Control verbosity of logging.
Advanced Arguments
--bad-lod-score-cutoff -bad-lod-cutoff	-5.0	LOD score cutoff for selecting bad variants
--dirichlet	0.001	The dirichlet parameter in the variational Bayes algorithm.
--k-means-iterations	100	Number of k-means iterations
--max-attempts	1	Number of attempts to build a model before failing
--max-gaussians	8	Max number of Gaussians for the positive model
--max-iterations	150	Maximum number of VBEM iterations
--max-negative-gaussians	2	Max number of Gaussians for the negative model
--maximum-training-variants	2500000	Maximum number of training data
--minimum-bad-variants	1000	Minimum number of bad variants
--mq-cap-for-logit-jitter-transform -mq-cap	0	Apply logit transform and jitter to MQ values
--prior-counts	20.0	The number of prior counts to use in the variational Bayes algorithm.
--showHidden	false	display hidden arguments
--shrinkage	1.0	The shrinkage parameter in the variational Bayes algorithm.
--standard-deviation-threshold -std	10.0	Annotation value divergence threshold (number of standard deviations from the means)
--trust-all-polymorphic	false	Trust that all the input training sets' unfiltered records contain only polymorphic sites to drastically speed up the computation.

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

--add-output-sam-program-record / -add-output-sam-program-record

If true, adds a PG tag to created SAM/BAM/CRAM files.

boolean true

--add-output-vcf-command-line / -add-output-vcf-command-line

If true, adds a command line header line to created VCF files.

boolean true

--aggregate / -aggregate

Additional raw input variants to be used in building the model
These additional calls should be unfiltered and annotated with the error covariates that are intended to be used for modeling.

List[FeatureInput[VariantContext]] []

--arguments_file / NA

read one or more arguments files and add them to the command line

List[File] []

--bad-lod-score-cutoff / -bad-lod-cutoff

LOD score cutoff for selecting bad variants
Variants scoring lower than this threshold will be used to build the Gaussian model of bad variants.

double -5.0 [ [ -∞ ∞ ] ]

--cloud-index-prefetch-buffer / -CIPB

Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset.

int -1 [ [ -∞ ∞ ] ]

--cloud-prefetch-buffer / -CPB

Size of the cloud-only prefetch buffer (in MB; 0 to disable).

int 40 [ [ -∞ ∞ ] ]

--create-output-bam-index / -OBI

If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file.

boolean true

--create-output-bam-md5 / -OBM

If true, create a MD5 digest for any BAM/SAM/CRAM file created

boolean false

--create-output-variant-index / -OVI

If true, create a VCF index when writing a coordinate-sorted VCF file.

boolean true

--create-output-variant-md5 / -OVM

If true, create a a MD5 digest any VCF file created.

boolean false

--dirichlet / NA

The dirichlet parameter in the variational Bayes algorithm.

double 0.001 [ [ -∞ ∞ ] ]

--disable-bam-index-caching / -DBIC

If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified.

boolean false

--disable-read-filter / -DF

Read filters to be disabled before analysis

List[String] []

--disable-sequence-dictionary-validation / -disable-sequence-dictionary-validation

If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk!

boolean false

--disable-tool-default-read-filters / -disable-tool-default-read-filters

Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on)

boolean false

--exclude-intervals / -XL

One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite). This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals (e.g. -XL myFile.intervals).

List[String] []

--gatk-config-file / NA

A configuration file to use with the GATK.

String null

--gcs-max-retries / -gcs-retries

If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection

int 20 [ [ -∞ ∞ ] ]

--gcs-project-for-requester-pays / NA

Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed.

String ""

--help / -h

display the help message

boolean false

--ignore-all-filters / NA

If specified, the variant recalibrator will ignore all input filters. Useful to rerun the VQSR from a filtered output file.

boolean false

--ignore-filter / NA

If specified, the variant recalibrator will also use variants marked as filtered by the specified filter name in the input VCF file
For this to work properly, the --ignore-filter argument should also be applied to the ApplyRecalibration command.

List[String] []

--input / -I

BAM/SAM/CRAM file containing reads

List[String] []

--input-model / NA

If specified, the variant recalibrator will read the VQSR model from this file path.
The filename for a VQSR model fit to use to recalibrate the input variants. This model should be generated using a previous VariantRecalibration run with the --output-model argument.

String null

--interval-exclusion-padding / -ixp

Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when analyzing exomes.

int 0 [ [ -∞ ∞ ] ]

--interval-merging-rule / -imr

Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not actually overlap) into a single continuous interval. However you can change this behavior if you want them to be treated as separate intervals instead.

The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:

ALL
OVERLAPPING_ONLY

IntervalMergingRule ALL

--interval-padding / -ip

Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when analyzing exomes.

int 0 [ [ -∞ ∞ ] ]

--interval-set-rule / -isr

Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will always be merged using UNION). Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.

The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:

UNION: Take the union of all intervals
INTERSECTION: Take the intersection of intervals (the subset that overlaps all intervals specified)

IntervalSetRule UNION

--intervals / -L

One or more genomic intervals over which to operate

List[String] []

--k-means-iterations / NA

Number of k-means iterations
This parameter determines the number of k-means iterations to perform in order to initialize the means of the Gaussians in the Gaussian mixture model.

int 100 [ [ -∞ ∞ ] ]

--lenient / -LE

Lenient processing of VCF files

boolean false

--max-attempts / NA

Number of attempts to build a model before failing
The statistical model being built by this tool may fail due to simple statistical sampling issues. Rather than dying immediately when the initial model fails, this argument allows the tool to restart with a different random seed and try to build the model again. The first successfully built model will be kept. Note that the most common underlying cause of model building failure is that there is insufficient data to build a really robust model. This argument provides a workaround for that issue but it is preferable to provide this tool with more data (typically by including more samples or more territory) in order to generate a more robust model.

int 1 [ [ -∞ ∞ ] ]

--max-gaussians / NA

Max number of Gaussians for the positive model
This parameter determines the maximum number of Gaussians that should be used when building a positive model using the variational Bayes algorithm.

int 8 [ [ -∞ ∞ ] ]

--max-iterations / NA

Maximum number of VBEM iterations
This parameter determines the maximum number of VBEM iterations to be performed in the variational Bayes algorithm. The procedure will normally end when convergence is detected.

int 150 [ [ -∞ ∞ ] ]

--max-negative-gaussians / NA

Max number of Gaussians for the negative model
This parameter determines the maximum number of Gaussians that should be used when building a negative model using the variational Bayes algorithm. The actual maximum used is the smaller value between the mG and mNG arguments, meaning that if -mG is smaller than -mNG, -mG will be used for both. Note that this number should be small (e.g. 4) to achieve the best results.

int 2 [ [ -∞ ∞ ] ]

--maximum-training-variants / NA

Maximum number of training data
The number of variants to use in building the Gaussian mixture model. Training sets larger than this will be randomly downsampled.

int 2500000 [ [ -∞ ∞ ] ]

--minimum-bad-variants / NA

Minimum number of bad variants
This parameter determines the minimum number of variants that will be selected from the list of worst scoring variants to use for building the Gaussian mixture model of bad variants.

int 1000 [ [ -∞ ∞ ] ]

--mode / -mode

Recalibration mode to employ
Use either SNP for recalibrating only SNPs (emitting indels untouched in the output VCF) or INDEL for indels (emitting SNPs untouched in the output VCF). There is also a BOTH option for recalibrating both SNPs and indels simultaneously, but this is meant for testing purposes only and should not be used in actual analyses.

The --mode argument is an enumerated type (Mode), which can have one of the following values:

SNP
INDEL
BOTH

Mode SNP

--mq-cap-for-logit-jitter-transform / -mq-cap

Apply logit transform and jitter to MQ values
MQ is capped at a "max" value (60 for bwa-mem) when the alignment is considered perfect. Typically, a huge proportion of the reads in a dataset are perfectly mapped, which yields a distribution of MQ values with a blob below the max value and a huge peak at the max value. This does not conform to the expectations of the Gaussian mixture model of VQSR and has been observed to yield a ROC curve with a jump. This argument aims to mitigate this problem. Using MQCap = X has 2 effects: (1) MQs are transformed by a scaled logit on [0,X] (+ epsilon to avoid division by zero) to make the blob more Gaussian-like and (2) the transformed MQ=X are jittered to break the peak into a narrow Gaussian. Beware that IndelRealigner, if used, adds 10 to MQ for successfully realigned indels. We recommend to either use --read-filter ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller or use a MQCap=max+10 to take that into account. If this option is not used, or if MQCap is set to 0, MQ will not be transformed.

int 0 [ [ -∞ ∞ ] ]

--output / -O

The output recal file used by ApplyRecalibration

R String null

--output-model / NA

If specified, the variant recalibrator will output the VQSR model to this file path.
This GATKReport gives information to describe the VQSR model fit. Normalized means for the positive model are concatenated as one table and negative model normalized means as another table. Covariances are also concatenated for positive and negative models, respectively. Tables of annotation means and standard deviations are provided to help describe the normalization. The model fit report can be read in with our R gsalib package. Individual model Gaussians can be subset by the value in the "Gaussian" column if desired.

String null

--prior-counts / NA

The number of prior counts to use in the variational Bayes algorithm.

double 20.0 [ [ -∞ ∞ ] ]

--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean false

--read-filter / -RF

Read filters to be applied before analysis

List[String] []

--read-index / -read-index

Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.

List[String] []

--read-validation-stringency / -VS

Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --read-validation-stringency argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency SILENT

--reference / -R

Reference sequence

String null

--resource / -resource

A list of sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run)
Any set of VCF files to use as lists of training, truth, or known sites. Training - The program builds the Gaussian mixture model using input variants that overlap with these training sites. Truth - The program uses these truth sites to determine where to set the cutoff in VQSLOD sensitivity. Known - The program only uses known sites for reporting purposes (to indicate whether variants are already known or novel). They are not used in any calculations by the algorithm itself. Bad - A database of known bad variants can be used to supplement the set of worst ranked variants (compared to the Gaussian mixture model) that the program selects from the data to model "bad" variants.

R List[FeatureInput[VariantContext]] []

--rscript-file / NA

The output rscript file generated by the VQSR to aid in visualization of the input data and learned model

String null

--seconds-between-progress-updates / -seconds-between-progress-updates

Output traversal statistics every time this many seconds elapse

double 10.0 [ [ -∞ ∞ ] ]

--sequence-dictionary / -sequence-dictionary

Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file.

String null

--showHidden / -showHidden

display hidden arguments

boolean false

--shrinkage / NA

The shrinkage parameter in the variational Bayes algorithm.

double 1.0 [ [ -∞ ∞ ] ]

--sites-only-vcf-output / NA

If true, don't emit genotype fields when writing vcf file output.

boolean false

--standard-deviation-threshold / -std

Annotation value divergence threshold (number of standard deviations from the means)
If a variant has annotations more than -std standard deviations away from mean, it won't be used for building the Gaussian mixture model.

double 10.0 [ [ -∞ ∞ ] ]

--target-titv / -titv

The expected novel Ti/Tv ratio to use when calculating FDR tranches and for display on the optimization curve output figures. (approx 2.15 for whole genome experiments). ONLY USED FOR PLOTTING PURPOSES!
The expected transition / transversion ratio of true novel variants in your targeted region (whole genome, exome, specific genes), which varies greatly by the CpG and GC content of the region. See expected Ti/Tv ratios section of the GATK best practices documentation (https://software.broadinstitute.org/gatk/guide/best-practices) for more information. Normal values are 2.15 for human whole genome values and 3.2 for human whole exomes. Note that this parameter is used for display purposes only and isn't used anywhere in the algorithm!

double 2.15 [ [ -∞ ∞ ] ]

--tmp-dir / NA

Temp directory to use.

String null

--tranches-file / NA

The output tranches file used by ApplyRecalibration

R String null

--trust-all-polymorphic / NA

Trust that all the input training sets' unfiltered records contain only polymorphic sites to drastically speed up the computation.

boolean false

--truth-sensitivity-tranche / -tranche

The levels of truth sensitivity at which to slice the data. (in percent, that is 1.0 for 1 percent)
Add truth sensitivity slices through the call set at the given values. The default values are 100.0, 99.9, 99.0, and 90.0 which will result in 4 estimated tranches in the final call set: the full set of calls (100% sensitivity at the accessible sites in the truth set), a 99.9% truth sensitivity tranche, along with progressively smaller tranches at 99% and 90%. Note: You must pass in each tranche as a separate value (e.g. -tranche 100.0 -tranche 99.9).

List[Double] [100.0, 99.9, 99.0, 90.0]

--use-allele-specific-annotations / -AS

If specified, the variant recalibrator will attempt to use the allele-specific versions of the specified annotations.
Generate a VQSR model using per-allele data instead of the default per-site data, assuming that the input VCF contains allele-specific annotations. Annotations should be specified using their full names with AS_ prefix. Non-allele-specific (scalar) annotations will be applied to all alleles.

boolean false

--use-annotation / -an

The names of the annotations which should used for calculations
See the input VCF file's INFO field for a list of all available annotations.

R List[String] []

--use-jdk-deflater / -jdk-deflater

Whether to use the JdkDeflater (as opposed to IntelDeflater)

boolean false

--use-jdk-inflater / -jdk-inflater

Whether to use the JdkInflater (as opposed to IntelInflater)

boolean false

--variant / -V

One or more VCF files containing variants

R List[String] []

--verbosity / -verbosity

Control verbosity of logging.

The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel INFO

--version / NA

display the version number for this tool

boolean false

Return to top

GATK version 4.1.1.0 built at Sat, 23 Nov 2019 17:29:19 -0500.

Genome Analysis Toolkit

Need Help?

Community Forum

Articles in this section

Category Variant Filtering

Overview

Summary of the VQSR procedure

Inputs

Outputs

Usage example

Recalibrating SNPs in exome data

Allele-specific version of the SNP recalibration (beta)

Caveats

Additional notes

Additional Information

Read filters

VariantRecalibrator specific arguments

Argument details

--add-output-sam-program-record / -add-output-sam-program-record

--add-output-vcf-command-line / -add-output-vcf-command-line

--aggregate / -aggregate

--arguments_file / NA

--bad-lod-score-cutoff / -bad-lod-cutoff

--cloud-index-prefetch-buffer / -CIPB

--cloud-prefetch-buffer / -CPB

--create-output-bam-index / -OBI

--create-output-bam-md5 / -OBM

--create-output-variant-index / -OVI

--create-output-variant-md5 / -OVM

--dirichlet / NA

--disable-bam-index-caching / -DBIC

--disable-read-filter / -DF

--disable-sequence-dictionary-validation / -disable-sequence-dictionary-validation

--disable-tool-default-read-filters / -disable-tool-default-read-filters

--exclude-intervals / -XL

--gatk-config-file / NA

--gcs-max-retries / -gcs-retries

--gcs-project-for-requester-pays / NA

--help / -h

--ignore-all-filters / NA

--ignore-filter / NA

--input / -I

--input-model / NA

--interval-exclusion-padding / -ixp

--interval-merging-rule / -imr

--interval-padding / -ip

--interval-set-rule / -isr

--intervals / -L

--k-means-iterations / NA

--lenient / -LE

--max-attempts / NA

--max-gaussians / NA

--max-iterations / NA

--max-negative-gaussians / NA

--maximum-training-variants / NA

--minimum-bad-variants / NA

--mode / -mode

--mq-cap-for-logit-jitter-transform / -mq-cap

--output / -O

--output-model / NA

--prior-counts / NA

--QUIET / NA

--read-filter / -RF

--read-index / -read-index

--read-validation-stringency / -VS

--reference / -R

--resource / -resource

--rscript-file / NA

--seconds-between-progress-updates / -seconds-between-progress-updates

--sequence-dictionary / -sequence-dictionary

--showHidden / -showHidden

--shrinkage / NA

--sites-only-vcf-output / NA

--standard-deviation-threshold / -std

--target-titv / -titv

--tmp-dir / NA

--tranches-file / NA

--trust-all-polymorphic / NA

--truth-sensitivity-tranche / -tranche

--use-allele-specific-annotations / -AS