Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

SelectVariants v4.1.6.0 doesn't select the variants as expected

Answered
0

8 comments

  • Avatar
    Bhanu Gandham

    Hi ABours

     

    I am looking into this and will get back to you soon.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    HI ABours

     

    Can you please share a few vcf variant records for the type 5(spanning deletion + snp). Also please share the exact command you used to generate that vcf i.e. SelectVariants with select-type-to-include SNP. This will help us troubleshoot this issue. 

    0
    Comment actions Permalink
  • Avatar
    ABours

    Hi Bhanu,

    Thank you for looking into this.

    The command I used was (gatk4 her is v4.1.6.0):

    java -Xms24G -Xmx24G -jar ${gatk4} SelectVariants -R ${ref} -V final_v4.1.6.0.vcf.gz -select-type SNP -O final_snp_v4.1.6.0.vcf.gz

    and these are a few of those type 5  sites from the subsetted vcf:

    chr_1 3632 . A C,* 55.53 . AC=1,7;AF=0.050,0.350;AN=20;BaseQRankSum=0.060;DP=233;ExcessHet=7.9825;FS=2.137;InbreedingCoeff=-0.0586;MLEAC=2,12;MLEAF=0.100,0.600;MQ=51.08;MQRankSum=-1.981e+00;QD=0.87;ReadPosRankSum=-1.465e+00;SOR=0.809 GT:AD:DP:GQ:PGT:PID:PL:PS 2|2:0,0,3:3:9:1|1:3628_CGGGATGGGACAGAGATCCCT_C:135,135,135,9,9,0:3628 ./.:11,0,0:11:.:.:.:0,0,0,0,0,0 ./.:20,0,0:20:.:.:.:0,0,0,0,0,0 0|1:5,2,0:7:55:0|1:3632_A_C:55,0,204,70,210,280:3632 0|2:5,0,9:14:99:0|1:3628_CGGGATGGGACAGAGATCCCT_C:348,363,573,0,210,182:3628 ./.:22,0,0:22:.:.:.:0,0,0,0,0,0 ./.:12,0,0:12:.:.:.:0,0,0,0,0,0 0/0:17,0,0:17:0:.:.:0,0,155,0,155,155 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 0/0:5,0,0:5:0:.:.:0,0,117,0,117,117 ./.:16,0,0:16:.:.:.:0,0,0,0,0,0 0|2:2,0,7:9:62:0|1:3628_CGGGATGGGACAGAGATCCCT_C:289,295,379,0,84,62:3628 ./.:13,0,0:13:.:.:.:0,0,0,0,0,0 0|2:3,0,3:6:95:0|1:3629_GGGATGGGA_G:95,104,230,0,126,117:3629 ./.:28,0,0:28:.:.:.:0,0,0,0,0,0 0/0:6,0,0:6:15:.:.:0,15,225,15,225,225 0|2:8,0,5:13:99:0|1:3628_CGGGATGGGACAGAGATCCCT_C:188,212,511,0,300,283:3628 0|2:3,0,9:12:80:0|1:3628_CGGGATGGGACAGAGATCCCT_C:370,379,487,0,108,80:3628 ./.:11,0,0:11:.:.:.:0,0,0,0,0,0
    chr_1 3633 . T *,C,A 1545.11 . AC=7,16,1;AF=0.250,0.571,0.036;AN=28;BaseQRankSum=1.15;DP=164;ExcessHet=4.0770;FS=0.000;InbreedingCoeff=0.1758;MLEAC=9,19,1;MLEAF=0.321,0.679,0.036;MQ=36.05;MQRankSum=0.00;QD=16.79;ReadPosRankSum=1.15;SOR=0.774 GT:AD:DP:GQ:PGT:PID:PL:PS 1|1:0,3,0,0:3:9:1|1:3628_CGGGATGGGACAGAGATCCCT_C:135,9,0,135,9,135,135,9,135,135:3628 ./.:11,0,0,0:11:.:.:.:0,0,0,0,0,0,0,0,0,0 2|2:0,0,2,0:2:6:1|1:3633_T_C:90,90,90,6,6,0,90,90,6,90:3633 0|3:6,0,0,2:8:52:0|1:3632_A_C:52,70,303,70,303,303,0,233,233,227:3632 1/2:0,9,5,0:14:99:.:.:553,184,183,365,0,347,554,199,364,563 2|2:0,0,3,0:3:9:1|1:3633_T_C:119,119,119,9,9,0,119,119,9,119:3633 2|2:0,0,1,0:1:3:1|1:3633_T_C:42,42,42,3,3,0,42,42,3,42:3633 2|2:0,0,8,0:8:24:1|1:3633_T_C:352,352,352,24,24,0,352,352,24,352:3633 ./.:0,0,0,0:0:.:.:.:0,0,0,0,0,0,0,0,0,0 0|2:1,0,3,0:4:20:0|1:3633_T_C:108,111,139,0,29,20,111,139,29,139:3633 2/2:0,0,4,0:4:13:.:.:181,181,181,13,13,0,181,181,13,181 0|1:2,7,0,0:9:62:0|1:3628_CGGGATGGGACAGAGATCCCT_C:289,0,62,295,84,379,295,84,379,379:3628 ./.:13,0,0,0:13:.:.:.:0,0,0,0,0,0,0,0,0,0 1/2:0,3,3,0:6:96:.:.:196,99,116,104,0,96,202,114,105,214 ./.:28,0,0,0:28:.:.:.:0,0,0,0,0,0,0,0,0,0 2/2:0,0,6,0:6:18:.:.:245,245,245,18,18,0,245,245,18,245 0|1:8,5,0,0:13:99:0|1:3628_CGGGATGGGACAGAGATCCCT_C:188,0,283,212,300,511,212,300,511,511:3628 1/2:1,9,2,0:12:56:.:.:434,59,56,353,0,373,437,84,378,462 ./.:11,0,0,0:11:.:.:.:0,0,0,0,0,0,0,0,0,0
    chr_1 3736 . G A,* 232.35 . AC=2,1;AF=0.053,0.026;AN=38;BaseQRankSum=1.05;DP=229;ExcessHet=3.3995;FS=2.781;InbreedingCoeff=-0.1209;MLEAC=2,1;MLEAF=0.053,0.026;MQ=34.01;MQRankSum=1.56;QD=5.16;ReadPosRankSum=1.25;SOR=1.312 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:3,0,0:3:9:.:.:0,9,74,9,74,74 0/0:10,0,0:10:30:.:.:0,30,327,30,327,327 0/0:6,0,0:6:18:.:.:0,18,241,18,241,241 0/0:25,0,0:25:66:.:.:0,66,990,66,990,990 0|1:7,3,2:12:99:0|1:3628_CGGGATGGGACAGAGATCCCT_C:105,0,270,126,279,405:3628 0/0:5,0,0:5:15:.:.:0,15,140,15,140,140 0/0:5,0,0:5:15:.:.:0,15,150,15,150,150 0|2:15,1,6:22:99:0|1:3633_T_C:203,248,841,0,592,574:3633 0/0:9,0,0:9:27:.:.:0,27,310,27,310,310 0/0:2,0,0:2:6:.:.:0,6,37,6,37,37 0/0:15,0,0:15:42:.:.:0,42,630,42,630,630 0|1:7,4,0:11:99:0|1:3628_CGGGATGGGACAGAGATCCCT_C:147,0,267,168,279,447:3628 0/0:3,0,0:3:9:.:.:0,9,92,9,92,92 0/0:9,0,0:9:27:.:.:0,27,319,27,319,319 0/0:39,0,0:39:80:.:.:0,80,1355,80,1355,1355 0/0:10,0,0:10:0:.:.:0,0,151,0,151,151 0/0:17,0,0:17:10:.:.:0,10,534,10,534,534 0/0:9,0,0:9:24:.:.:0,24,360,24,360,360 0/0:16,0,0:16:29:.:.:0,29,536,29,536,536

    if you want to have the numbers, on the output file of my command I call ~30.3 million SNPs, and of this 2,5% are type 5 sites.

    I hope this helps,

     

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi ABours

     

    Looking into the code it seems like the tool is doing what its supposed to. To clarify, when we say MIXED we mean SNP/Indel + Symbolic. Symbolic variation is represented like this <*>.

    SelectVariants actually ignores spanning deletion, which is represented by *.

     

    I agree this is confusing, but I hope this explanation helps,

    0
    Comment actions Permalink
  • Avatar
    ABours

    Hi Bhanu,

    Ok, it's nice to know that the tool does what the code says. also the confusion remains, but I can work with it.

    I would say I'm just questioning whether the code is logical for what one expects. Of course, I realize that I'll have to work with what I've got (at least at the moment). I mainly just wanted to provide some feedback/suggestion to one of your tools in your very nice toolset.

    Thanks for check this and best,

     

    0
    Comment actions Permalink
  • Avatar
    Matt Johnson

    Hello, I am also having this issue. This page indicates the following definition of SYMBOLIC:

    SYMBOLIC (such as the <NON-REF> allele used in GVCFs produced by HaplotypeCaller, the * allele used to signify the presence of a spanning deletion, or undefined events like a very large allele or one that's fuzzy and not fully modeled; i.e. there's some event going on here but we don't know what exactly)

    Therefore I would expect SelectVariants --select-type-to-exclude SYMBOLIC to not have any calls containing spanning deletions. However, the output VCFs still do (in gatk4 4.2.0.0).

    This causes problems for downstream tools like FastaAlternateReferenceMaker:

    java.lang.IllegalArgumentException: the input sequence contains invalid base calls like: *

    Is there any way to force GATK to exclude spanning deletions when filtering a VCF?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Matt Johnson You can use a JEXL expression with --SelectExpressions to filter out the spanning deletions. Read more about JEXL expressions here: https://gatk.broadinstitute.org/hc/en-us/articles/360035891011-JEXL-filtering-expressions

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Matt Johnson we tested this issue further and we were not able to get it to work with the JEXL expressions so I created an issue ticket so that our team can get this functionality working: https://github.com/broadinstitute/gatk/issues/7341

    Thanks for bringing it to our attention!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk