This article outlines two different approaches to site-level variant filtration. Site-level filtering involves using INFO field annotations in filtering.
In Section 1, we will outline the steps in Variant Quality Score Recalibration (VQSR). In Section 2, we will outline the steps in hard-filtering. See this article for in-depth descriptions of the different variant annotations.
The GATK Best Practices recommends filtering germline variant callsets with VQSR. For WDL script implementations of pipelines the Broad Genomics Platform uses in production, i.e. reference implementations, see links, e.g. provided on the Germline Best Practices Workflow to both the gatk-workflows WDL script repository and to Terra workspaces. Each includes example data as well as publicly available recommended human population resources. See this article on truth sets for training and this one about VQSR for more information.
Hard-filtering is useful when the data cannot support VQSR or when an analysis requires manual filtering. Additionally, hard-filtering allows for filtering on sample-level annotations, i.e. FORMAT field annotations, which this article does not cover. See this tutorial to filter on FORMAT field attributes and to change the genotypes of such filtered sample sites to NULL ./.
.
► GATK4 offers a deep learning method to filter germline variants that is applicable to single sample callsets. As of this writing, the CNN workflow is in experimental status (check here for an update). See this CNN notebook on Terra for an overview and initial benchmarking results, and see the gatk4-cnn-variant-filter repository for the WDL pipeline.
► For more complex variant filtering and annotation, see the Broad Hail.is framework at https://hail.is/index.html.
► After variant filtration, if downstream analyses require high-quality genotype calls, consider genotype refinement, e.g. filtering posterior-corrected GQ<20 genotypes. See this article for an overview.
Jump to a section
1. VQSR: filter a cohort callset with VariantRecalibrator & ApplyVQSR
This section outlines the VQSR filtering steps performed in the 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline. Note the workflow hard-filters on the ExcessHet annotation before filtering with VQSR with the expectation that the callset represents many samples.
[A] Hard-filter a large cohort callset on ExcessHet using VariantFiltration ExcessHet filtering applies only to callsets with a large number of samples, e.g. hundreds of unrelated samples. Small cohorts should not trigger ExcessHet filtering as values should remain small. Note cohorts of consanguinous samples will inflate ExcessHet, and it is possible to limit the annotation to founders for such cohorts by providing a pedigree file during variant calling.
gatk --java-options "-Xmx3g -Xms3g" VariantFiltration \ -V cohort.vcf.gz \ --filter-expression "ExcessHet > 54.69" \ --filter-name ExcessHet \ -O cohort_excesshet.vcf.gz
This produces a VCF callset where any record with ExcessHet greater than 54.69 is filtered with the ExcessHet label in the FILTER column. The phred-scaled 54.69 corresponds to a z-score of -4.5. If a record lacks the ExcessHet annotation, it will pass filters.
[B] Create sites-only VCF with MakeSitesOnlyVcf Site-level filtering requires only site-level annotations. We can speed up the analysis in the modeling step by using a VCF that drops sample-level columns.
gatk MakeSitesOnlyVcf \ -I cohort_excesshet.vcf.gz \ -O cohort_sitesonly.vcf.gz
This produces a VCF that retains the first eight-columns.
[C] Calculate VQSLOD tranches for indels using VariantRecalibrator All of the population resource files are publically available at gs://gcp-public-data--broad-references/hg38/v0. The parameters in this article reflect those in the 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline and are thus tuned for WGS samples. For recommendations specific to exome samples, reasons why SNPs versus indels require different filtering and additional discussion of training sets and arguments, see this article. For example, the article states:
For filtering indels, most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.
gatk --java-options "-Xmx24g -Xms24g" VariantRecalibrator \ -V cohort_sitesonly.vcf.gz \ --trust-all-polymorphic \ -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 \ -an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP \ -mode INDEL \ --max-gaussians 4 \ -resource:mills,known=false,training=true,truth=true,prior=12:Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \ -resource:axiomPoly,known=false,training=true,truth=false,prior=10:Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz \ -resource:dbsnp,known=true,training=false,truth=false,prior=2:Homo_sapiens_assembly38.dbsnp138.vcf \ -O cohort_indels.recal \ --tranches-file cohort_indels.tranches
The --max-gaussians
parameter sets the expected number of clusters in modeling. If a dataset gives fewer distinct clusters, e.g. as can happen for smaller data, then the tool will tell you there is insufficient data with a No data found
error message. In this case, try decrementing the --max-gaussians
value.
[D] Calculate VQSLOD tranches for SNPs using VariantRecalibrator
gatk --java-options "-Xmx3g -Xms3g" VariantRecalibrator \ -V cohort_sitesonly.vcf.gz \ --trust-all-polymorphic \ -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \ -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP \ -mode SNP \ --max-gaussians 6 \ -resource:hapmap,known=false,training=true,truth=true,prior=15:hapmap_3.3.hg38.vcf.gz \ -resource:omni,known=false,training=true,truth=true,prior=12:1000G_omni2.5.hg38.vcf.gz \ -resource:1000G,known=false,training=true,truth=false,prior=10:1000G_phase1.snps.high_confidence.hg38.vcf.gz \ -resource:dbsnp,known=true,training=false,truth=false,prior=7:Homo_sapiens_assembly38.dbsnp138.vcf \ -O cohort_snps.recal \ --tranches-file cohort_snps.tranches
Each step, C and D, produces a .recal
recalibration table and a .tranches
tranches table. In the filtering step, ApplyVQSR will use both types of data.
- To additionally produce the optional tranche plot, specify the
--rscript-file
parameter. See the VariantRecalibrator tool documentation for details and this discussion thread for an example plot. - For allele-specific recalibration of an allele-specific callset, a beta feature as of this writing, add the
-AS
parameter.
☞ 1.1 How can I parallelize VQSR?
For cohorts with more than 10,000 WGS samples, it is possible to break down the analysis across genomic regions for parallel processing. The 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline does so first by increasing --java-options
to "-Xmx100g -Xms100g"
and second by add the following parameters to the command to subsample variants and to produce a file of the VQSR model.
--sample-every-Nth-variant 10 \ --output-model ${model_report_filename} \
The pipeline then applies the resulting model to each genomic interval with the same parameters as above with two additions. It provides the resulting model report to VariantRecalibrator with --input-model
and specifies the flag --output-tranches-for-scatter
. The pipeline then collates the resulting per-interval tranches with GatherTranches. Refer to the pipeline script for implementation details.
Successively apply the indel and SNP recalibrations to the full callset that has already undergone ExcessHet filtering.
[E] Filter indels on VQSLOD using ApplyVQSR
gatk --java-options "-Xmx5g -Xms5g" \ ApplyVQSR \ -V cohort_excesshet.vcf.gz \ --recal-file cohort_indels.recal \ --tranches-file cohort_indels.tranches \ --truth-sensitivity-filter-level 99.7 \ --create-output-variant-index true \ -mode INDEL \ -O indel.recalibrated.vcf.gz
This produces an indel-filtered callset. At this point, SNP-type variants remain unfiltered.
[F] Filter SNPs on VQSLOD using ApplyVQSR
gatk --java-options "-Xmx5g -Xms5g" \ ApplyVQSR \ -V indel.recalibrated.vcf.gz \ --recal-file ${snps_recalibration} \ --tranches-file ${snps_tranches} \ --truth-sensitivity-filter-level 99.7 \ --create-output-variant-index true \ -mode SNP \ -O snp.recalibrated.vcf.gz \
This produces a SNP-filtered callset. Given the indel-filtered callset, this results in the final filtered callset.
2. Hard filter a cohort callset with VariantFiltration
This section of the tutorial provides generic hard-filtering thresholds and example commands for site-level manual filtering. A typical scenario requiring manual filtration is small cohort callsets, e.g. less than thirty exomes.
Researchers are expected to fine-tune hard-filtering thresholds for their data. Towards gauging the relative informativeness of specific variant annotations, the GATK hands-on hard-filtering workshop tutorial demonstrates how to plot distributions of annotation values for variant calls stratified against a truthset.
As with VQSR, hard-filter SNPs and indels separately. As of this writing, SelectVariants subsets SNP-only records, indel-only records or mixed-type, i.e. SNP and indel alternate alleles in the same record, separately. Therefore, when subsetting to SNP-only or indel-only records, mixed-type records are excluded. See this GitHub ticket for the status of a feature request to apply VariantFiltration directly on types of variants.
To avoid the loss of mixed-type variants, break up the multiallelic records into biallelic records before proceeding with the following subsetting. Alternatively, to process mixed-type variants with indel filtering thresholds similar to VQSR, add -select-type MIXED
to the second command [B].
[A] Subset to SNPs-only callset with SelectVariants
gatk SelectVariants \ -V cohort.vcf.gz \ -select-type SNP \ -O snps.vcf.gz
This produces a VCF with records with SNP-type variants only.
[B] Subset to indels-only callset with SelectVariants
gatk SelectVariants \ -V cohort.vcf.gz \ -select-type INDEL \ -O indels.vcf.gz
This produces a VCF with records with indel-type variants only.
[C] Hard-filter SNPs on multiple expressions using VariantFiltration
The GATK does not recommend use of compound filtering expressions, e.g. the logical ||
"OR". For such expressions, if a record is null for or missing a particular annotation in the expression, the tool negates the entire compound expression and so automatically passes the variant record even if it fails on one of the expressions. See this issue ticket for details.
Provide each expression separately with the -filter
parameter followed by the –-filter-name
. The tool evaluates each expression independently. Here we show basic filtering thresholds researchers may find useful to start.
gatk VariantFiltration \ -V snps.vcf.gz \ -filter "QD < 2.0" --filter-name "QD2" \ -filter "QUAL < 30.0" --filter-name "QUAL30" \ -filter "SOR > 3.0" --filter-name "SOR3" \ -filter "FS > 60.0" --filter-name "FS60" \ -filter "MQ < 40.0" --filter-name "MQ40" \ -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \ -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \ -O snps_filtered.vcf.gz
This produces a VCF with the same variant records now annotated with filter status. Specifically, if a record passes all the filters, it receives a PASS
label in the FILTER column. A record that fails a filter receives the filter name in the FILTER column, e.g. SOR3
. If a record fails multiple filters, then each failing filter name appears in the FILTER column separated by semi-colons ;
, e.g. MQRankSum-12.5;ReadPosRankSum-8
.
[D] Similarly, hard-filter indels on multiple expressions using VariantFiltration
gatk VariantFiltration \ -V indels.vcf.gz \ -filter "QD < 2.0" --filter-name "QD2" \ -filter "QUAL < 30.0" --filter-name "QUAL30" \ -filter "FS > 200.0" --filter-name "FS200" \ -filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \ -O indels_filtered.vcf.gz
This produces a VCF with the same variant records annotated with filter names for failing records. At this point, consider merging the separate callsets together. Select comments follow.
- RankSum annotations can only be calculated for REF/ALT heterozygous sites and therefore will be absent from records that do not present read counts towards heterozygous genotypes.
- By default, GATK HaplotypeCaller and GenotypeGVCFs do not emit variants with QUAL < 10. The
--standard-min-confidence-threshold-for-calling
(-stand-call-conf
) parameter adjusts this threshold. GATK recommends filtering variants with QUAL less than 30. The lower default QUAL threshold of the callers allows for more negative training data in VQSR filtering. - When providing filtering thresholds, the tool expects the value to match the type specified in the
##INFO
lines of the callset. For example, an Integer type is a whole number without decimals, e.g. 0, and a Float type is a number with decimals, e.g. 0.0. If the expected type mismatches, the tool will give ajava.lang.NumberFormatException
error. - If a filter expression is misspelled, the tool does not give a warning, so be sure to carefully review filter expressions for correctness.
3. Evaluate the filtered callset
Filtering is about balancing sensitivity and precision for research aims. For example, genome-wide association studies can afford to maximize sensitivity over precision such that there are more false positives in the callset. Conversely, downstream analyses that require high precision, e.g. those that cannot tolerate false positive calls because validating variants is expensive, maximize precision over sensitivity such that the callset loses true positives.
Two tools enable site-level evaluation--CollectVariantCallingMetrics and VariantEval. Another tool, GenotypeConcordance, measures sample-level genotype concordance and is not covered here. For an overview of all three tools, see this article.
Compare callset against a known population callset using CollectVariantCallingMetrics
gatk CollectVariantCallingMetrics \ -I filtered.vcf.gz \ --DBSNP Homo_sapiens_assembly38.dbsnp138.vcf \ -SD Homo_sapiens_assembly38.dict \ -O metrics
This produces detailed and summary metrics report files. The summary metrics provide cohort-level variant metrics and the detailed metrics segment variant metrics for each sample in the callset. The detail metrics give the same metrics as the summary metrics for the samples plus several additional metrics. These are explained in detail at https://broadinstitute.github.io/picard/picard-metric-definitions.html.
Compare callset against a known population callset using VariantEval As of this writing, VariantEval is in beta status in GATK v4.1. And so we provide an example GATK3 command, where the tool is in production status. GATK3 Dockers are available at https://hub.docker.com/r/broadinstitute/gatk3.
java -jar gatk3.jar \ -T VariantEval \ -R Homo_sapiens_assembly38.fasta \ -eval cohort.vcf.gz \ -D Homo_sapiens_assembly38.dbsnp138.vcf \ -noEV \ -EV CompOverlap -EV IndelSummary -EV TiTvVariantEvaluator \ -EV CountVariants -EV MultiallelicSummary \ -o cohortEval.txt
This produces a file containing a table for each of the evaluation modules, e.g. CompOverlap.
Please note the GA4GH (Global Alliance for Genomics and Health) recommends using hap.py for stratified variant evaluations (1, 2). One approach using hap.py wraps the vcfeval module of RTG-Tools. The module accounts for differences in variant representation by matching variants mapped back to the reference.
9 comments
Very useful ! Well, I am wondering if the data is whole genome sequencing,is it necessary to add DP < min || DP > 2.5 times avrage depth in Hard-filter step?
Look forward to your favourable reply.
I cannot view the files in the gs://gcp-public-data--broad-references/hg38/v0
It seems we need Storage Object Viewer permission.
Hi
I have had the same problem like Min Ou with downloading the data. But I was able to find them on https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0;tab=objects?prefix= .
Unfortunately I still cannot ran the VariantRecalibrator with the suggested parameters, as the relevant Info fields are lacking and as the individual information is not included it also cannot be added by hand. Therefore I get the error:
A USER ERROR has occurred: Bad input: Values for FS annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations.
Is there a way to get the missing INFO fields for the resource datasets?
The syntax for specifying argument tags in VariantRecalibrator has changed. I came across an error as "No argument value found for tagged argument:" using GATK(v4.2.0.0). It's ok when I changed the parameters like this: --resource:hapmap,known=false,training=true,truth=true,prior=15.0 /trainee/ref/hapmap_3.3.hg38.vcf
.
Hi, I am performing a whole exome sequencing study and I am wondering if I could use some hard filter steps like QD<2, DP<3 & GQ<20 after performing the VQSR filter or if it is not recommended using both together? My study is in germline, and I have a small sample size (around 70).
As mentioned by @stl23, example commands as currently given above require updating to account for syntax change, see https://gatk.broadinstitute.org/hc/en-us/articles/360035532192
Dear GATK development team,
I am working on some fairly rare/neglected parasites, of which we have hundreds of samples sequenced at a 25X coverage (some samples up to 50X) across the whole genome. I'd like to use VQSR filtering and incorporate it into our Nextflow pipeline, however there are no "True" datasets out there. I was wondering if I could create one myself, by selecting a subset of the best quality files and conducting joint genotyping and use the resulting cohort VCF as a True dataset to correct the others. Would that be possible?
All my best,
Max
Subject: Issue with undefined variables in VariantFiltration tool
Dear GATK Community,
I am encountering warnings about undefined variables when using the VariantFiltration tool in GATK version gatk4-4.5.0.0-0. Specifically, I am applying filters based on the `MQRankSum` and `ReadPosRankSum` annotations in my VCF file. Despite confirming the presence of these annotations and modifying the JEXL expressions accordingly, the warnings persist.
Key details:
- GATK version: gatk4-4.5.0.0-0
- Command used: VariantFiltration with modified JEXL expressions:
- `MQRankSum < -12.5` for MQRankSum filter
- `ReadPosRankSum < -8.0` for ReadPosRankSum filter
- Issue: Warnings about undefined variables for `MQRankSum` and `ReadPosRankSum`
"15:14:50.582 WARN JexlEngine - ![0,14]: 'ReadPosRankSum < -8.0;' undefined variable ReadPosRankSum
15:14:50.582 WARN JexlEngine - ![0,9]: 'MQRankSum < -12.5;' undefined variable MQRankSum
"
I kindly request your guidance in resolving this issue. If there are known bugs or compatibility concerns related to these annotations in the VariantFiltration tool of GATK version gatk4-4.5.0.0-0, please advise.
Thank you for your valuable support. I look forward to your response.
Sincerely,
Afridi
Hello!
I have a question about the application of CollectVariantcallingmetrics. Should I use in the separeted files generated by the hard filtering (indels and snps)? Should I use in my cohort file? Because the result shown here is divided into indels and snps https://github.com/broadgsa/gatk/blob/master/doc_archive/tutorials/(howto)_Evaluate_a_callset_with_CollectVariantCallingMetrics.md
I'm confused. Thank you.
Please sign in to leave a comment.