Apply a score cutoff to filter variants based on a recalibration table
Category Variant Filtering
OverviewApply a score cutoff to filter variants based on a recalibration table
This tool performs the second pass in a two-stage process called Variant Quality Score Recalibration (VQSR). Specifically, it applies filtering to the input variants based on the recalibration table produced in the first step by VariantRecalibrator and a target sensitivity value, which the tool matches internally to a VQSLOD score cutoff based on the model's estimated sensitivity to a set of true variants.
The filter determination is not just a pass/fail process. The tool evaluates for each variant which "tranche", or slice of the dataset, it falls into in terms of sensitivity to the truthset. Variants in tranches that fall below the specified truth sensitivity filter level have their FILTER field annotated with the corresponding tranche level. This results in a callset that is filtered to the desired level but retains the information necessary to increase sensitivity if needed.
To be clear, please note that by "filtered", we mean that variants failing the requested tranche cutoff are marked as filtered in the output VCF; they are not discarded unless the option to do so is specified.
Summary of the VQSR procedure
The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. These probabilities can then be used to filter the variants with a greater level of accuracy and flexibility than can typically be achieved by traditional hard-filter (filtering on individual annotation value thresholds). The first pass consists of building a model that describes how variant annotation values co-vary with the truthfulness of variant calls in a training set, and then scoring all input variants according to the model. The second pass simply consists of specifying a target sensitivity value (which corresponds to an empirical VQSLOD cutoff) and applying filters to each variant call according to their ranking. The result is a VCF file in which variants have been assigned a score and filter status.
VQSR is probably the hardest part of the Best Practices to get right, so be sure to read the method documentation, parameter recommendations and tutorial to really understand what these tools do and how to use them for best results on your own data.
- The raw input variants to be filtered.
- The recalibration table file that was generated by the VariantRecalibrator tool.
- The tranches file that was generated by the VariantRecalibrator tool.
- A recalibrated VCF file in which each variant of the requested type is annotated with its VQSLOD and marked as filtered if the score is below the desired quality level.
Applying recalibration/filtering to SNPs
gatk ApplyVQSR \ -R Homo_sapiens_assembly38.fasta \ -V input.vcf.gz \ -O output.vcf.gz \ --truth-sensitivity-filter-level 99.0 \ --tranches-file output.tranches \ --recal-file output.recal \ -mode SNP
Allele-specific version of the SNP filtering (beta)
gatk ApplyVQSR \ -R Homo_sapiens_assembly38.fasta \ -V input.vcf.gz \ -O output.vcf.gz \ -AS \ --truth-sensitivity-filter-level 99.0 \ --tranches-file output.AS.tranches \ --recal-file output.AS.recal \ -mode SNP
Note that the tranches and recalibration files must have been produced by an allele-specific run of VariantRecalibrator. Also note that the AS_culprit, AS_FilterStatus, and AS_VQSLOD fields will have placeholder values (NA or NaN) for alleles of a type that have not yet been processed by ApplyRecalibration. The spanning deletion allele (*) will not be recalibrated because it represents missing data. Its VQSLOD will remain NaN, and its culprit and FilterStatus will be NA.
Each allele will be annotated by its corresponding entry in the AS_FilterStatus INFO field annotation. Allele-specific VQSLOD and culprit are also carried through from VariantRecalibrator, and stored in the AS_VQSLOD and AS_culprit INFO fields, respectively. The site-level filter is set to the most lenient of any of the allele filters. That is, if one allele passes, the whole site will be PASS. If no alleles pass, the site-level filter will be set to the lowest sensitivity tranche among all the alleles.
- The tranche values used in the example above are only meant to be a general example. You should determine the level of sensitivity that is appropriate for your specific project. Remember that higher sensitivity (more power to detect variants, yay!) comes at the cost of specificity (more false negatives, boo!). You have to choose at what point you want to set the tradeoff.
- In order to create the tranche reporting plots (which are only generated for SNPs, not indels!) the Rscript executable needs to be in your environment PATH (this is the scripting version of R, not the interactive version).
ApplyVQSR specific arguments
|Argument name(s)||Default value||Summary|
|null||The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value|
||null||The input recal file used by ApplyRecalibration|
|||One or more VCF files containing variants|
|Optional Tool Arguments|
||||read one or more arguments files and add them to the command line|
|-1||Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset.|
|40||Size of the cloud-only prefetch buffer (in MB; 0 to disable).|
|false||If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified.|
||false||Don't output filtered loci after applying the recalibration|
|20||If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection|
|false||display the help message|
||false||If specified, the variant recalibrator will ignore all input filters. Useful to rerun the VQSR from a filtered output file.|
||||If specified, the recalibration will be applied to variants marked as filtered by the specified filter name in the input VCF file|
|ALL||Interval merging rule for abutting intervals|
|||One or more genomic intervals over which to operate|
||SNP||Recalibration mode to employ: 1.) SNP for recalibrating only SNPs (emitting indels untouched in the output VCF); 2.) INDEL for indels; and 3.) BOTH for recalibrating both SNPs and indels simultaneously.|
||false||If true, don't emit genotype fields when writing vcf file output.|
||null||The input tranches file describing where to cut the data|
|null||The truth sensitivity level at which to start filtering|
|false||If specified, the tool will attempt to apply a filter to each allele based on the input tranches and allele-specific .recal file.|
||false||display the version number for this tool|
|Optional Common Arguments|
||true||If true, adds a PG tag to created SAM/BAM/CRAM files.|
||true||If true, adds a command line header line to created VCF files.|
|true||If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file.|
|false||If true, create a MD5 digest for any BAM/SAM/CRAM file created|
|true||If true, create a VCF index when writing a coordinate-sorted VCF file.|
|false||If true, create a a MD5 digest any VCF file created.|
|||Read filters to be disabled before analysis|
|||One or more genomic intervals to exclude from processing|
||null||A configuration file to use with the GATK.|
|||BAM/SAM/CRAM file containing reads|
|0||Amount of padding (in bp) to add to each interval you are excluding.|
|0||Amount of padding (in bp) to add to each interval you are including.|
|UNION||Set merging approach to use for combining interval inputs|
|false||Lenient processing of VCF files|
||false||Whether to suppress job-summary info on System.err.|
|||Read filters to be applied before analysis|
||10.0||Output traversal statistics every time this many seconds elapse|
||null||Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file.|
|false||Whether to use the JdkDeflater (as opposed to IntelDeflater)|
|false||Whether to use the JdkInflater (as opposed to IntelInflater)|
||INFO||Control verbosity of logging.|
||null||The VQSLOD score below which to start filtering|
||false||display hidden arguments|
--add-output-sam-program-record / -add-output-sam-program-record
If true, adds a PG tag to created SAM/BAM/CRAM files.
--add-output-vcf-command-line / -add-output-vcf-command-line
If true, adds a command line header line to created VCF files.
--arguments_file / NA
read one or more arguments files and add them to the command line
--cloud-index-prefetch-buffer / -CIPB
Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset.
int -1 [ [ -∞ ∞ ] ]
--cloud-prefetch-buffer / -CPB
Size of the cloud-only prefetch buffer (in MB; 0 to disable).
int 40 [ [ -∞ ∞ ] ]
--create-output-bam-index / -OBI
If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file.
--create-output-bam-md5 / -OBM
If true, create a MD5 digest for any BAM/SAM/CRAM file created
If true, create a VCF index when writing a coordinate-sorted VCF file.
--create-output-variant-md5 / -OVM
If true, create a a MD5 digest any VCF file created.
--disable-bam-index-caching / -DBIC
If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified.
--disable-read-filter / -DF
Read filters to be disabled before analysis
--disable-sequence-dictionary-validation / -disable-sequence-dictionary-validation
--disable-tool-default-read-filters / -disable-tool-default-read-filters
--exclude-filtered / NA
Don't output filtered loci after applying the recalibration
--exclude-intervals / -XL
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite). This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals (e.g. -XL myFile.intervals).
--gatk-config-file / NA
A configuration file to use with the GATK.
--gcs-max-retries / -gcs-retries
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
--help / -h
display the help message
--ignore-all-filters / NA
If specified, the variant recalibrator will ignore all input filters. Useful to rerun the VQSR from a filtered output file.
--ignore-filter / NA
If specified, the recalibration will be applied to variants marked as filtered by the specified filter name in the input VCF file
For this to work properly, the -ignoreFilter argument should also be applied to the VariantRecalibration command.
--input / -I
BAM/SAM/CRAM file containing reads
--interval-exclusion-padding / -ixp
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-merging-rule / -imr
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not actually overlap) into a single continuous interval. However you can change this behavior if you want them to be treated as separate intervals instead.
--interval-padding / -ip
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-set-rule / -isr
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will always be merged using UNION). Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.
- Take the union of all intervals
- Take the intersection of intervals (the subset that overlaps all intervals specified)
--intervals / -L
One or more genomic intervals over which to operate
--lenient / -LE
Lenient processing of VCF files
--lod-score-cutoff / NA
The VQSLOD score below which to start filtering
--mode / -mode
Recalibration mode to employ: 1.) SNP for recalibrating only SNPs (emitting indels untouched in the output VCF); 2.) INDEL for indels; and 3.) BOTH for recalibrating both SNPs and indels simultaneously.
The --mode argument is an enumerated type (Mode), which can have one of the following values:
--output / -O
The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value
R String null
--QUIET / NA
Whether to suppress job-summary info on System.err.
--read-filter / -RF
Read filters to be applied before analysis
--read-index / -read-index
The --read-validation-stringency argument is an enumerated type (ValidationStringency), which can have one of the following values:
--recal-file / NA
The input recal file used by ApplyRecalibration
R FeatureInput[VariantContext] null
--reference / -R
--seconds-between-progress-updates / -seconds-between-progress-updates
Output traversal statistics every time this many seconds elapse
double 10.0 [ [ -∞ ∞ ] ]
--sequence-dictionary / -sequence-dictionary
Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file.
--showHidden / -showHidden
display hidden arguments
If true, don't emit genotype fields when writing vcf file output.
--TMP_DIR / NA
--tranches-file / NA
The input tranches file describing where to cut the data
--truth-sensitivity-filter-level / -ts-filter-level
The truth sensitivity level at which to start filtering
If specified, the tool will attempt to apply a filter to each allele based on the input tranches and allele-specific .recal file.
Filter the input file based on allele-specific recalibration data. See tool docs for site-level and allele-level filtering details. Requires a .recal file produced using an allele-specific run of VariantRecalibrator.
--use-jdk-deflater / -jdk-deflater
Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater / -jdk-inflater
Whether to use the JdkInflater (as opposed to IntelInflater)
--variant / -V
One or more VCF files containing variants
R List[String] 
--verbosity / -verbosity
Control verbosity of logging.
--version / NA
display the version number for this tool
GATK version 22.214.171.124 built at 25-10-2019 02:10:54.