Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

JEXL filtering expressions Follow

5 comments

  • Avatar
    Yangyxt
    gatk SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select "isHet == 1" \
    -O output.vcf

    The command above failed and returned an empty vcf file. 

    Here is the screenshot of input vcf file

    0
    Comment actions Permalink
  • Avatar
    Yangyxt

    I also tried :

    gatk SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select "vc.getGenotype('<sampleID>').isHet()" \
    -O output.vcf

    Which failed either. 

    BTW, I'm trying on GATK 4.1.8.1

    0
    Comment actions Permalink
  • Avatar
    ProxyRaven

    Hi, I could not make it work with SelectVariants using JEXL expression. However I managed to filter my VCF files for a specified VAF (AD of the altered allele / DP) using VariantFiltration as shown below with GATK 4.1.9

    --genotype-filter-expression 'vc.getGenotype("samplename").getAD().1*1.0 / (vc.getGenotype("samplename").getDP()*1.0) > 0.40' \ --genotype-filter-name "VAF04" \

    It writes the filter name within the genotype infos and I used then:

    gatk SelectVariants \
    -V filtered.vcf \
    --set-filtered-gt-to-nocall \
    -O filtered_NoCall.vcf
    0
    Comment actions Permalink
  • Avatar
    Ruqian Lyu

    I believe I was having the same issue with Yangyxt, but I figured it out by changing the expression from 

    --genotype-filter-expression "isHet==1"
     
    to 
     
    --genotype-filter-expression isHet==1 
     
     
    So removing the double quotes when using `isHet`.
     
    Hope this helps.
     
    0
    Comment actions Permalink
  • Avatar
    everestial007

    a) GATK version used:

    GATK =gatk-package-4.2.5.0-local.jar

    $ java -jar ${GATK} --version
    The Genome Analysis Toolkit (GATK) v4.2.5.0
    HTSJDK Version: 2.24.1
    Picard Version: 2.25.4

    I just had to do some filtering of my own data and tested several methods (to see what the outcome would be). I am sharing those here for others and myself, if it is ever needed again.


    method to call the sites that are all homozygous Variants (1/1, 2/2 genotype)

        java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == 6'
        # outcome: no sites will have ambigous (./.) and/or homRef (0/0). All other homVar are allowed

    in the above script if we want the site where at least 4 samples are homozygous ref or variants

        java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 4'
        # outcome: atleast 4 samples are homVar, rest can be any genotype (./., 0/0, 0/1, etc.)

     at least 1 sample is homVar (1/1, 2/2 or 3/3)

        java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() != 0'

    all the samples are homRef (0/0 genotypes)

        java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == 6'

    Note: 6 chr = 3 samples (when samples are diploid)

    sites that are all HomVar or nocalls (./.)

        java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == vc.getCalledChrCount()/2'
        # outcome: no 0/0, 0/1
        # Note: samples with GT = ./. are not counted as called chr

    Useful after samples are sub-sampled from multisample vcfs

    sites where no genotypes were called for all the samples in the selected vcf
    or say, select vcf sites that are not called in any sample (i.e don't have any genotypes, only contain ./.)

        java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 0'
        # outcome = vcf sites (lines) that are all no calls (GT = ./.) on all samples are selected

    samples are either homRef (0/0) or no call (./.)

        java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == vc.getCalledChrCount()/2'
        # outcome: vcf sites where all the samples are GT = ./. (no call) or GT = 0/0

    site where 6 chromosomes (exact) or 3 samples were called for diploid genome

        java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 6'
        # Note of Caution: 6 chr = 3 samples (when samples are diploid)

    site where at least 3 samples (diploid) are called

        java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() >=6'

    if I have 6-sample vcf

        java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() ==12'
        # 12 haploid chr = 6 samples (diploid)
        # outcome: sites that were called in all samples, there will be no GT = ./.

    select vcf sites where no sample has hetVar (no GT = 0/1, 0/2), only GT = ./., 0/0, 1/1, 2/2

        java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0'

    select vcf sites where no sample has hetVar (no Gt = 0/1, 0/2), and at least one sample is HomVar (GT = 1/1)

        java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0 && vc.getHomVarCount() >= 1'

    site where at least 1 sample is homVar (GT = 1/1 and/or 2/2 etc.)

        java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 1'


    where a site is only homRef for a particular sample and is only SNP

        java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getGenotype("ms01e").isHomRef()' -O test_variants.ms01e.Chr_2_3.vcf

    where a site is only homVar for a particular sample and is only SNP

        java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getGenotype("ms01e").isHomVar()' -O test_variants.ms01e.Chr_2_3.vcf

    where a site is not homRef for a particular sample and is only SNP. This will also select empty sites

        java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select '! vc.getGenotype("ms01e").isHomRef()' -O test_variants.ms01e.Chr_2_3.vcf


    NOTE: following genotype call methods won't work:

        # to get only heterozygous sites
        -select 'vc.getGenotype("ms01e").isHetVar() 

        # to get any variant sites
        -select 'vc.getGenotype("ms01e").isVar() 

        # won't work if getGenotype method is combined
        -select 'vc.getGenotype("ms01e").isHomVar() || vc.getGenotype("ms01e".isHomRef()'  


    So, I figured out the following methods where count helps with selecting particular site.

    where a site is a heterozygous variant site for a particular sample

        java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select 'vc.getHetCount() == 1'  -O test_variants.ms01e.vcf

        NOTE: 'vc.getGenotype("ms01e").isHetVar()' wont work so generated the above expression.

    where a site is a variant site (either homVar or hetVar) for a particular sample and is only SNP

        java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getHetCount() == 1 || vc.getHomVarCount() == 1'  -O test_variants.ms01e.vcf

        NOTE: 'vc.getGenotype("NA12878").isHomRef()' || vc.getGenotype("NA12878").isHomVar()' wont work together, so generated this above expression.

    Other updates to follow. Also check here https://www.biostars.org/p/255362/#255567
    Thanks,

     

    1
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk