Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

SelectVariants JEXL expression works in somatic matched pair but not germline cohort

0

8 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Mark Godek, are you running these commands any differently, perhaps the cohort command through bash? Also, can you confirm that those annotations are present in both files? It looks like this issue may be caused by your JEXL expression not being correctly passed in your GATK command. 

    Here is some documentation about JEXL with GATK: https://gatk.broadinstitute.org/hc/en-us/articles/360035891011-JEXL-filtering-expressions

    0
    Comment actions Permalink
  • Avatar
    Mark Godek

    I'm following the GATK best practices for somatic and germline SNP discovery.

    For cohort command through bash, do you mean using gatk GenotypeGVCFs to call variants of a cohort from a genomics DB?

    Here are two variants from GNT923.novel.somatic.annotated.vcf and germline.Aim1.vcf used in the commands above, respectively.

    Somatic

    1 891292 . T C . base_qual;haplotype;normal_artifact;panel_of_normals;weak_evidence AS_FilterStatus=weak_evidence,base_qual;AS_SB_TABLE=161,5|14,0;DP=189;ECNT=2;FUNCOTATION=[NOC2L|hg19|chr1|891292|891292|INTRON||SNP|T|T|C|g.chr1:891292T>C|ENST00000327044.6|-|||c.e6-11A>G|||0.5960099750623441|CTTCACCCCCTCCCCTCTTAC|KLHL17_ENST00000338591.3_FIVE_PRIME_FLANK/NOC2L_ENST00000487214.1_INTRON|||||||||||||||||||||||||90|biliary_tract(2)_%7C_breast(11)_%7C_central_nervous_system(44)_%7C_large_intestine(11)_%7C_pancreas(22)|||||||BC003555|NM_015658.3|NP_056473.2|HGNC:24517|NOC2_%20_like_%20_nucleolar_%20_associated_%20_transcriptional_%20_repressor|Approved|gene_%20_with_%20_protein_%20_product|protein-coding_%20_gene||"nucleolar_%20_complex_%20_associated_%20_2_%20_homolog_%20_(S._%20_cerevisiae)"|DKFZP564C186_%2C__%20_NET7_%2C__%20_NET15_%2C__%20_NIR_%2C__%20_PPP1R112|"novel_%20_INHAT_%20_repressor"_%2C__%20_"protein_%20_phosphatase_%20_1_%2C__%20_regulatory_%20_subunit_%20_12"|1p36.33|2016-02-15||2016-02-15|AL050019||26155|ENSG00000188976|20123734_%2C__%20_22363708|NM_015658|694|Protein_%20_phosphatase_%20_1_%20_regulatory_%20_subunits|CCDS3|OTTHUMG00000040720|26155|610770|NM_015658|Q9Y3T9|ENSG00000188976|uc001abz.5|hg19|OREG1492698|Type_%3D_TRANSCRIPTION_%20_FACTOR_%20_BINDING_%20_SITE_%7C_Gene_Symbol_%3D_NOC2L_%7C_Gene_ID_%3D_ENSG00000188976_%7C_Gene_Source_%3D_ENSEMBL_%7C_Regulatory_Element_Symbol_%3D_EGR1_%7C_Regulatory_Element_ID_%3D_ENST00000239938_%7C_Regulatory_Element_Source_%3D_ENSEMBL_%7C_PMID_%3D_18971253_%7C_Dataset_%3D_PAZAR|NOC2L_HUMAN||Q5SVA3_%7C_Q9BTN6|Q9Y3T9|apoptotic_%20_process_%20_(GO:0006915)_%7C_cellular_%20_response_%20_to_%20_UV_%20_(GO:0034644)_%7C_chromatin_%20_assembly_%20_(GO:0031497)_%7C_negative_%20_regulation_%20_of_%20_B_%20_cell_%20_apoptotic_%20_process_%20_(GO:0002903)_%7C_negative_%20_regulation_%20_of_%20_histone_%20_acetylation_%20_(GO:0035067)_%7C_negative_%20_regulation_%20_of_%20_intrinsic_%20_apoptotic_%20_signaling_%20_pathway_%20_(GO:2001243)_%7C_negative_%20_regulation_%20_of_%20_transcription_%20_from_%20_RNA_%20_polymerase_%20_II_%20_promoter_%20_(GO:0000122)_%7C_nucleolus_%20_to_%20_nucleoplasm_%20_transport_%20_(GO:0032066)_%7C_transcription_%2C__%20_DNA-templated_%20_(GO:0006351)|nucleolus_%20_(GO:0005730)_%7C_nucleoplasm_%20_(GO:0005654)_%7C_nucleus_%20_(GO:0005634)|chromatin_%20_binding_%20_(GO:0003682)_%7C_histone_%20_binding_%20_(GO:0042393)_%7C_nucleosome_%20_binding_%20_(GO:0031491)_%7C_poly(A)_%20_RNA_%20_binding_%20_(GO:0044822)_%7C_repressing_%20_transcription_%20_factor_%20_binding_%20_(GO:0070491)_%7C_transcription_%20_corepressor_%20_activity_%20_(GO:0003714)|||||||||||||||||||false|false||false|false||false|false|false||false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|||false|false||false||false||false|false|false||false|||false|||];GERMQ=93;MBQ=34,11;MFRL=214,213;MMQ=60,60;MPOS=17;NALOD=-3.866;NLOD=18.17;PON;POPAF=3.91;TLOD=5.15 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:72,10:0.087:82:19,1:36,0:0|1:891292_T_C:891292:69,3,10,0 0|0:94,4:0.038:98:12,2:53,0:0|1:891292_T_C:891292:92,2,4,0

    Germline

    1 179924563 . C G 56.36 PASS AC=1;AF=0.031;AN=32;BaseQRankSum=-0.842;DP=58;ExcessHet=3.0103;FS=0;FUNCOTATION=[CEP350|hg19|chr1|179924563|179924563|INTRON||SNP|C|C|G|g.chr1:179924563C>G|ENST00000367607.3|+|||c.e1+286C>G|||0.5610972568578554|GGTAAGATACCTCTAGCCTTT|RP11-533E19.5_ENST00000567904.1_FIVE_PRIME_FLANK|||||||||||||||||||||||||31|breast(1)_%7C_central_nervous_system(22)_%7C_oesophagus(8)|||||||AL645487|NM_014810.4|NP_055625.4|HGNC:24238|centrosomal_%20_protein_%20_350|Approved|gene_%20_with_%20_protein_%20_product|protein-coding_%20_gene||"centrosomal_%20_protein_%20_350kDa"|KIAA0480_%2C__%20_CAP350|"centrosome_%20_associated_%20_protein_%20_350"|1q25.2|2016-03-30||2016-03-30|AF287356||9857|ENSG00000135837|16314388_%2C__%20_15615782|NM_014810|||CCDS1336|OTTHUMG00000035269|9857||NM_014810|Q5VT06|ENSG00000135837|uc001gnt.4|hg19|OREG1268027|Type_%3D_TRANSCRIPTION_%20_FACTOR_%20_BINDING_%20_SITE_%7C_Gene_Symbol_%3D_CEP350_%7C_Gene_ID_%3D_ENSG00000135837_%7C_Gene_Source_%3D_ENSEMBL_%7C_Regulatory_Element_Symbol_%3D_SMARCA4_%7C_Regulatory_Element_ID_%3D_ENST00000344626_%7C_Regulatory_Element_Source_%3D_ENSEMBL_%7C_PMID_%3D_18971253_%7C_Dataset_%3D_PAZAR|CE350_HUMAN||O75068_%7C_Q8TDK3_%7C_Q8WY20|Q5VT06|microtubule_%20_anchoring_%20_(GO:0034453)|centrosome_%20_(GO:0005813)_%7C_cytoplasm_%20_(GO:0005737)_%7C_membrane_%20_(GO:0016020)_%7C_nucleus_%20_(GO:0005634)||||||||||||||||||||true|false|0.9984_%2C_0.001597|false|false|1|false|false|false|CEP350:9857|false|false|true|true|true|false|false|false|false|false|false|false|false|false|false|false|false|false|false|false|143883644|179924563|false|false|0|false|0|false|0.99684_%2C_0.00315956|false|false|false|SNV|true|0x050000080005040036000100|1|false|134|rs143883644|];InbreedingCoeff=-0.1083;MLEAC=1;MLEAF=0.031;MQ=60;MQRankSum=0;QD=9.39;ReadPosRankSum=0.842;SOR=0.307;VQSLOD=19.42;culprit=MQ GT:AD:DP:GQ:PL 0/0:2,0:2:6:0,6,82 0/0:6,0:6:15:0,15,225 0/0:3,0:3:9:0,9,98 0/0:6,0:6:18:0,18,229 0/0:2,0:2:6:0,6,43 0/0:5,0:5:12:0,12,180 0/0:1,0:1:3:0,3,40 0/0:2,0:2:6:0,6,61 0/0:6,0:6:15:0,15,225 0/0:9,0:9:27:0,27,338 0/0:2,0:2:6:0,6,60 0/0:1,0:1:3:0,3,20 0/0:2,0:2:6:0,6,66 0/0:2,0:2:6:0,6,64 0/0:3,0:3:9:0,9,102 0/1:4,2:6:68:68,0,150

    It looks like the somatic variant is lacking an AC (allele count) annotation, but it still finishes properly. Germline has AC=1.

    DP (Allele Depth) is 189 for somatic, 58 for germline. There are also DP values for the matched samples in the somatic and cohort in the germline.

    AF (allele frequency) is 0.031 in germline and the somatic has PopAF and AF for the samples only.

    If anything, I would expect the somatic variant to be the one having trouble. I just ran both commands in interactive mode only changing the input VCF and still the germline is the only one that throws an error.

    I've even tried a simpler JEXL expression on a less filtered version of the germline file (command below) to see if it was too filtered or the expression was too complex.

    gatk SelectVariants -R /mnt/data/rbueno/ref_genomes/human/BWA-MEM/b37_decoys/hs37d5.fa -V annotated.b37names.germline.vcf -select "AC >= 2" --select-type-to-include SNP -O test.selectgermlinevariants.vcf

    The above command still yielded the "Invalid JEXL expression detected for select-0". Is it possible to filter germline cohort VCFs with SelectVariants?

    Thanks again.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Mark Godek, is there a reason why you are not using VariantFiltration for the filtering step? It seems like these JEXL expressions would be better in that tool, and you could use SelectVariants after for getting just the SNPs.

    0
    Comment actions Permalink
  • Avatar
    Mark Godek

    Hi Genevieve Brandt (she/her),

    I'm using SelectVariants because it was recommended under heading 6 of the Variant Call Format page of the GATK website.

    Actually, rereading it, it does say it works on cohorts, so I still don't know why I'm getting a JEXL expression error with SelectVariants but I will give VariantFiltration a shot and report back.

    Thanks.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Mark Godek, yes there are a lot of different tools in the GATK! I think VariantFiltration will work better for using multiple different filters, and it has notes in the documentation on how to properly submit the command with the filters you are using.

    Let me know if the error pops up there too.

    0
    Comment actions Permalink
  • Avatar
    Mark Godek

    Hi Genevieve Brandt (she/her),

    I tried VariantFiltration and it works, however it throws a warning as it is running. I'm not sure if this is a big deal because it does create the output VCF and adds the filter name to the FILTER column when my AC is over 1. 

    Command:

    gatk VariantFiltration -R /mnt/data/rbueno/ref_genomes/human/BWA-MEM/b37_decoys/hs37d5.fa -V /data/rbueno/analysis_files/MedGenome_FamilialMPMs/old_pipeline/analysis/germline/novel.germline.annotated.reselect.vcf -O /data/rbueno/analysis_files/MedGenome_FamilialMPMs/old_pipeline/analysis/germline/novel.germline.annotated.reselect.quality_reads.vcf --filter-name "Allele_Depth" --filter-expression "AC > 1"

    Warning:

    09:01:04.779 WARN  JexlEngine - ![0,6]: 'AC > 1;' > error

    Output:

    #CHROM POS ID REF ALT QUAL FILTER INFO

    1 13065 . G A 33.58 VQSRTrancheSNP99.90to100.00 AC=1

    1 13302 . C T 124.39 Allele_Depth;VQSRTrancheSNP99.90to100.00 AC=2

    1 13418 . G A 42.98 VQSRTrancheSNP99.90to100.00 AC=1

    1 13621 . A G 44.56 Allele_Depth;VQSRTrancheSNP99.90to100.00 AC=2

     

    Is there a way to exclude filtered variants from the output?

    The documentation says "Filtered records will be preserved in the output unless their removal is requested in the command line." but VariantFiltration lacks the --exclude-filtered option that SelectVariants has.

     

    Thanks.

    0
    Comment actions Permalink
  • Avatar
    Mark Godek

    I found a post on Biostars on excluding based on SelectVariants and it seems to have worked when I tried it.

    gatk SelectVariants -R /mnt/data/rbueno/ref_genomes/human/BWA-MEM/b37_decoys/hs37d5.fa --variant /data/rbueno/analysis_files/MedGenome_FamilialMPMs/old_pipeline/analysis/germline/novel.germline.annotated.reselect.invalidate_old_filters.vcf -O /data/rbueno/analysis_files/MedGenome_FamilialMPMs/old_pipeline/analysis/germline/filtered.vcf -select 'FILTER != "Allele_Depth"' --exclude-filtered true

    So SelectVariants will filter my cohort file, but not using JEXL expressions.

    Still, if there is a way to do this with VariantFiltration I'd be interested how to do it with a single command.

    Thanks again.

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Mark Godek Thanks for the update about how you are filtering. Yes, usually VariantFiltration and SelectVariants are used together.

    However, if another user has found a workaround they can chime into the discussion!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk