Before using VariantFiltration, please read the entirety of the discussion in GitHub that describes VariantFiltration's unintuitive behavior when processing compound expressions.
This tutorial illustrates how to filter on genotype, e.g. heterozygous genotype call. The steps apply to either single-sample or multi-sample callsets.
First, the genotype is annotated with a filter expression using VariantFiltration. Then, the filtered genotypes are made into no-call (
./.) genotypes with SelectVariants so that downstream tools may discount them.
We use example variant record FORMAT fields from
trio.vcf to illustrate.
GT:AD:DP:GQ:PL 0/1:17,15:32:99:399,0,439 0/1:11,12:23:99:291,0,292 1/1:0,30:30:90:948,90,0
1. Annotate genotypes using VariantFiltration
If we want to filter heterozygous genotypes, we use VariantFiltration's
--genotype-filter-expression "isHet == 1" option. We can specify the annotation value for the tool to label the heterozygous genotypes with with the
--genotype-filter-name option. Here, this parameter's value is set to
gatk VariantFiltration \ -V trio.vcf \ -O trio_VF.vcf \ --genotype-filter-expression "isHet == 1" \ --genotype-filter-name "isHetFilter"After filtering, in the resulting
trio_VF.vcf, our example record adds an
FTfield and becomes:
GT:AD:DP:FT:GQ:PL 0/1:17,15:32:isHetFilter:99:399,0,439 0/1:11,12:23:isHetFilter:99:291,0,292 1/1:0,30:30:PASS:90:948,90,0We see that HET (
0/1) genotype calls get a
FTfield and other genotype calls get a
PASSin the genotype field.
The VariantFiltration tool document lists the various options to filter on the FORMAT (aka genotype call) field:
We have put in convenience methods so that one can now filter out hets ("isHet == 1"), refs ("isHomRef == 1"), or homs ("isHomVar == 1"). Also available are expressions isCalled, isNoCall, isMixed, and isAvailable, in accordance with the methods of the Genotype object.
2. Transform filtered genotypes to no call
Running SelectVariants with
--set-filtered-gt-to-nocall will further transform the flagged genotypes with a null genotype call. This conversion is necessary because downstream tools do not parse the FORMAT-level filter field.
gatk SelectVariants \ -V trio_VF.vcf \ --set-filtered-gt-to-nocall \ -O trioGGVCF_VF_SV.vcfThe result is that the
GTgenotypes of the
isHetFilteredgenotype records become null or no call (
./.) as follows.
GT:AD:DP:FT:GQ:PL ./.:17,15:32:isHetFilter:99:399,0,439 ./.:11,12:23:isHetFilter:99:291,0,292 1/1:0,30:30:PASS:90:948,90,0