JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants
Contents
- Basic structure of JEXL expressions for use with the GATK
- Evaluation on multiple annotations
- Filtering on sample/genotype-level properties
- Important caveats
- More complex JEXL magic
1. Basic structure of JEXL expressions for use with the GATK
In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.
JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:
"QUAL > 30.0"
QUAL
is a key: the name of the annotation we want to look at30.0
is a value: the threshold that we want to use to evaluate variant quality against>
is an operator: it determines which "side" of the threshold we want to select
The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:
"MY_STRING_KEY == 'foo'"
2. Evaluation on multiple annotations
You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:
"QUAL / DP < 10.0"
You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):
"QUAL > 30.0 && DP == 10"
where &&
is the logical "AND".
In the case where you want to select variants that have at least one of several conditions fulfilled, provide each expression separately:
"QD < 2.0" \ "ReadPosRankSum < -20.0" \ "FS > 200.0"
To be on the safe, do not use compound expressions with the logical "OR" || as a missing annotation will negate the entire expression.
3. Filtering on sample/genotype-level properties
You can also filter individual samples/genotypes in a VCF based on information from the FORMAT field. Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples. Note however that this does not affect the record's FILTER tag. This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, you can filter based on most fields as normal (e.g. GQ < 5.0
), but the GT (genotype) field is an exception. We have put in convenience methods to enable filtering out heterozygous calls (isHet == 1), homozygous-reference calls (isHomRef == 1), and homozygous-variant calls (isHomVar == 1).
4. Important caveats
Sensitivity to case and type
You're probably used to case being important (whether letters are lowercase or UPPERCASE) but now you need to also pay attention to the type of value that is involved -- for example, numbers are differentiated between integers and floats (essentially, non-integers). These points are especially important to keep in mind:
Case
Currently, VCF INFO field keys are case-sensitive. That means that if you have aQUAL
field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual
,qual
or whatever) in your JEXL expression.Type
The types (i.e. string, integer, non-integer, floating point or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g.QUAL < 50
), the system will throw a hissy fit (specifically, a Java exception, e.g. aNumberFormatException
for numerical type mismatches).
Complex queries
We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.
5. More complex JEXL magic
Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves.
Warning: the command line examples given below have not yet been updated for GATK4 syntax.
Introducing the VariantContext object
When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant
library included within the Picard tools source code, which GATK uses for convenience.
The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.
In a nutshell, the VariantContext is available through the vc
variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.
Using VariantContext directly
For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:
gatk SelectVariants \ -R reference.fasta \ -V variants.vcf \ -select 'vc.getGenotype("NA12878").isHomRef()'
Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:
! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.highfreqnovels.vcf -sn 01-0263
Using the VariantContext to evaluate boolean values
The classic way of evaluating a boolean goes like this:
gatk SelectVariants \ -R reference.fasta \ -V my.vcf \ -select 'DB'
But you can also use the VariantContext object like this:
gatk SelectVariants \ -R reference.fasta \ -V my.vcf \ -select 'vc.hasAttribute("DB")'
Using VariantContext to access annotations in multiallelic sites
The order of alleles in the VariantContext object is not guaranteed to be the same as in the VCF output, so accessing the AF by an index derived from a scrambled alleles array is dangerous. However! If we have the sample genotypes, there's a workaround:
gatk SelectVariants \ -R reference.fasta \ -V multiallelics.vcf \ -select 'vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf
The odd 1.0 is there because otherwise we're dividing two integers, which will always yield 0. The vc.hasGenotypes()
is extra error checking. This might be slow for large files, but we could use something like this if performance is a concern:
gatk SelectVariants \ -R reference.fasta \ -V multiallelics.vcf \ -select 'vc.isBiallelic() ? AF > 0.1 : vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf
Where hopefully the ternary expression shortcuts the extra vc
calls for all the biallelics.
Using JEXL to evaluate arrays
Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:
gatk SelectVariants \ -R reference.fasta \ -V variants.vcf \ -select 'vc.getGenotype("NA12878").getAD().0 > 10'
If you would like to select sites where the alternate allele frequency is greater than 50%, you can use the following expression:
gatk SelectVariants \ -R reference.fasta \ -V variants.vcf \ -select vc.getGenotype("NA12878").getAD().1 / vc.getGenotype("NA12878").getDP() > 0.50
5 comments
The command above failed and returned an empty vcf file.
Here is the screenshot of input vcf file
I also tried :
Which failed either.
BTW, I'm trying on GATK 4.1.8.1
Hi, I could not make it work with SelectVariants using JEXL expression. However I managed to filter my VCF files for a specified VAF (AD of the altered allele / DP) using VariantFiltration as shown below with GATK 4.1.9
It writes the filter name within the genotype infos and I used then:
I believe I was having the same issue with Yangyxt, but I figured it out by changing the expression from
a) GATK version used:
GATK =gatk-package-4.2.5.0-local.jar
$ java -jar ${GATK} --version
The Genome Analysis Toolkit (GATK) v4.2.5.0
HTSJDK Version: 2.24.1
Picard Version: 2.25.4
I just had to do some filtering of my own data and tested several methods (to see what the outcome would be). I am sharing those here for others and myself, if it is ever needed again.
method to call the sites that are all homozygous Variants (1/1, 2/2 genotype)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == 6'
# outcome: no sites will have ambigous (./.) and/or homRef (0/0). All other homVar are allowed
in the above script if we want the site where at least 4 samples are homozygous ref or variants
java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 4'
# outcome: atleast 4 samples are homVar, rest can be any genotype (./., 0/0, 0/1, etc.)
at least 1 sample is homVar (1/1, 2/2 or 3/3)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() != 0'
all the samples are homRef (0/0 genotypes)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == 6'
Note: 6 chr = 3 samples (when samples are diploid)
sites that are all HomVar or nocalls (./.)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == vc.getCalledChrCount()/2'
# outcome: no 0/0, 0/1
# Note: samples with GT = ./. are not counted as called chr
Useful after samples are sub-sampled from multisample vcfs
sites where no genotypes were called for all the samples in the selected vcf
or say, select vcf sites that are not called in any sample (i.e don't have any genotypes, only contain ./.)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 0'
# outcome = vcf sites (lines) that are all no calls (GT = ./.) on all samples are selected
samples are either homRef (0/0) or no call (./.)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == vc.getCalledChrCount()/2'
# outcome: vcf sites where all the samples are GT = ./. (no call) or GT = 0/0
site where 6 chromosomes (exact) or 3 samples were called for diploid genome
java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 6'
# Note of Caution: 6 chr = 3 samples (when samples are diploid)
site where at least 3 samples (diploid) are called
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() >=6'
if I have 6-sample vcf
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() ==12'
# 12 haploid chr = 6 samples (diploid)
# outcome: sites that were called in all samples, there will be no GT = ./.
select vcf sites where no sample has hetVar (no GT = 0/1, 0/2), only GT = ./., 0/0, 1/1, 2/2
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0'
select vcf sites where no sample has hetVar (no Gt = 0/1, 0/2), and at least one sample is HomVar (GT = 1/1)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0 && vc.getHomVarCount() >= 1'
site where at least 1 sample is homVar (GT = 1/1 and/or 2/2 etc.)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 1'
where a site is only homRef for a particular sample and is only SNP
java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getGenotype("ms01e").isHomRef()' -O test_variants.ms01e.Chr_2_3.vcf
where a site is only homVar for a particular sample and is only SNP
java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getGenotype("ms01e").isHomVar()' -O test_variants.ms01e.Chr_2_3.vcf
where a site is not homRef for a particular sample and is only SNP. This will also select empty sites
java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select '! vc.getGenotype("ms01e").isHomRef()' -O test_variants.ms01e.Chr_2_3.vcf
NOTE: following genotype call methods won't work:
# to get only heterozygous sites
-select 'vc.getGenotype("ms01e").isHetVar()
# to get any variant sites
-select 'vc.getGenotype("ms01e").isVar()
# won't work if getGenotype method is combined
-select 'vc.getGenotype("ms01e").isHomVar() || vc.getGenotype("ms01e".isHomRef()'
So, I figured out the following methods where count helps with selecting particular site.
where a site is a heterozygous variant site for a particular sample
java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select 'vc.getHetCount() == 1' -O test_variants.ms01e.vcf
NOTE: 'vc.getGenotype("ms01e").isHetVar()' wont work so generated the above expression.
where a site is a variant site (either homVar or hetVar) for a particular sample and is only SNP
java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getHetCount() == 1 || vc.getHomVarCount() == 1' -O test_variants.ms01e.vcf
NOTE: 'vc.getGenotype("NA12878").isHomRef()' || vc.getGenotype("NA12878").isHomVar()' wont work together, so generated this above expression.
Other updates to follow. Also check here https://www.biostars.org/p/255362/#255567
Thanks,
Please sign in to leave a comment.