SelectVariants v4.1.6.0 doesn't select the variants as expected
AnsweredHi,
I'm subsetting the vcf that I got from GenotypeGVCFs into separate SNP and INDEL vcfs (to hard-filter), with SelectVariants from GATK version 4.1.6.0. However, the program doesn't behave how I expect it to, based on the following documentation:
https://gatk.broadinstitute.org/hc/en-us/articles/360035530752-What-types-of-variants-can-GATK-tools-detect-or-handle
https://gatk.broadinstitute.org/hc/en-us/articles/360041849831-SelectVariants#--select-type-to-include
In the 1st doc you specifically note (under subheader "Variant manipulation") that when running --select-type-to-include SNP with SelectVariants the tool will only pull out "pure" SNPs and that any mixed sites will not be in there. I check the behaviour of SelectVariants on the following types of sites and whether it's pulled by select-type <blank>:
- pure SNP -> pulled by select-type SNP
- pure deletion -> pulled by select-type INDEL
- mixed -> pulled by select-type MIXED
- spanning deletion and insertion -> pulled by select-type MIXED
- spanning deletion and snp -> pulled by select-type SNP
So for types 1 to 3 everything is alright. But type 4 and 5 are in contradiction to each other. Let me explain:
Either a site with a spanning deletion is considered a MIXED site and it's correct that type 4 is found by using select-type MIXED (and no other), however that would mean that also type 5 should be pulled by select-type MIXED (which it does not, I checked).
Or a site with a spanning deletion is considered a "light-variant" of a pure SNP or INDEL. This means that the behaviour seen by select-type SNP pulling out type 5 is correct, but type 4 should also be pulled out by doing select-type INDEL (which SelectVariants doesn't). Of course, if this is the case, than you should reconsider whether a spanning deletion will be pulled by select-type MIXED (as I saw with type 4).
Personally, I see a site with a spanning deletion and a snp as a MIXED site (because the spanning deletion inherently implies an INDEL in that region). Thus, from my point of view and how I read your docs the -select-type SNP unnecessarily inflates my (subsetted) pure SNP vcf by adding type 5 sites. For people with small genomes and/or low amount of samples (and high computer storage) this is not a big problem, but when the genomes become bigger and more samples are added, having a unnecessarily big SNP vcf (containing also type 5 sites) just takes up storage and computing time. Maybe you could make select-type SNP only pull pure SNPs (as select-type INDEL does) and add another function/select-type to SelectVariants that will specifically pull out SNPs and their "light-variant" a.k.a. type 5 from list above and one that specifically pulls out INDELs and their "light-variant" a.k.a. type 4 from list above.
I hope that my post is clear and that you can adjust the behaviour of SelectVariants, and that you maybe consider my suggestions.
Best,
-
Hi ABours
I am looking into this and will get back to you soon.
-
HI ABours
Can you please share a few vcf variant records for the type 5(spanning deletion + snp). Also please share the exact command you used to generate that vcf i.e. SelectVariants with select-type-to-include SNP. This will help us troubleshoot this issue.
-
Hi Bhanu,
Thank you for looking into this.
The command I used was (gatk4 her is v4.1.6.0):
java -Xms24G -Xmx24G -jar ${gatk4} SelectVariants -R ${ref} -V final_v4.1.6.0.vcf.gz -select-type SNP -O final_snp_v4.1.6.0.vcf.gz
and these are a few of those type 5 sites from the subsetted vcf:
chr_1 3632 . A C,* 55.53 . AC=1,7;AF=0.050,0.350;AN=20;BaseQRankSum=0.060;DP=233;ExcessHet=7.9825;FS=2.137;InbreedingCoeff=-0.0586;MLEAC=2,12;MLEAF=0.100,0.600;MQ=51.08;MQRankSum=-1.981e+00;QD=0.87;ReadPosRankSum=-1.465e+00;SOR=0.809 GT:AD:DP:GQ:PGT:PID:PL:PS 2|2:0,0,3:3:9:1|1:3628_CGGGATGGGACAGAGATCCCT_C:135,135,135,9,9,0:3628 ./.:11,0,0:11:.:.:.:0,0,0,0,0,0 ./.:20,0,0:20:.:.:.:0,0,0,0,0,0 0|1:5,2,0:7:55:0|1:3632_A_C:55,0,204,70,210,280:3632 0|2:5,0,9:14:99:0|1:3628_CGGGATGGGACAGAGATCCCT_C:348,363,573,0,210,182:3628 ./.:22,0,0:22:.:.:.:0,0,0,0,0,0 ./.:12,0,0:12:.:.:.:0,0,0,0,0,0 0/0:17,0,0:17:0:.:.:0,0,155,0,155,155 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 0/0:5,0,0:5:0:.:.:0,0,117,0,117,117 ./.:16,0,0:16:.:.:.:0,0,0,0,0,0 0|2:2,0,7:9:62:0|1:3628_CGGGATGGGACAGAGATCCCT_C:289,295,379,0,84,62:3628 ./.:13,0,0:13:.:.:.:0,0,0,0,0,0 0|2:3,0,3:6:95:0|1:3629_GGGATGGGA_G:95,104,230,0,126,117:3629 ./.:28,0,0:28:.:.:.:0,0,0,0,0,0 0/0:6,0,0:6:15:.:.:0,15,225,15,225,225 0|2:8,0,5:13:99:0|1:3628_CGGGATGGGACAGAGATCCCT_C:188,212,511,0,300,283:3628 0|2:3,0,9:12:80:0|1:3628_CGGGATGGGACAGAGATCCCT_C:370,379,487,0,108,80:3628 ./.:11,0,0:11:.:.:.:0,0,0,0,0,0
chr_1 3633 . T *,C,A 1545.11 . AC=7,16,1;AF=0.250,0.571,0.036;AN=28;BaseQRankSum=1.15;DP=164;ExcessHet=4.0770;FS=0.000;InbreedingCoeff=0.1758;MLEAC=9,19,1;MLEAF=0.321,0.679,0.036;MQ=36.05;MQRankSum=0.00;QD=16.79;ReadPosRankSum=1.15;SOR=0.774 GT:AD:DP:GQ:PGT:PID:PL:PS 1|1:0,3,0,0:3:9:1|1:3628_CGGGATGGGACAGAGATCCCT_C:135,9,0,135,9,135,135,9,135,135:3628 ./.:11,0,0,0:11:.:.:.:0,0,0,0,0,0,0,0,0,0 2|2:0,0,2,0:2:6:1|1:3633_T_C:90,90,90,6,6,0,90,90,6,90:3633 0|3:6,0,0,2:8:52:0|1:3632_A_C:52,70,303,70,303,303,0,233,233,227:3632 1/2:0,9,5,0:14:99:.:.:553,184,183,365,0,347,554,199,364,563 2|2:0,0,3,0:3:9:1|1:3633_T_C:119,119,119,9,9,0,119,119,9,119:3633 2|2:0,0,1,0:1:3:1|1:3633_T_C:42,42,42,3,3,0,42,42,3,42:3633 2|2:0,0,8,0:8:24:1|1:3633_T_C:352,352,352,24,24,0,352,352,24,352:3633 ./.:0,0,0,0:0:.:.:.:0,0,0,0,0,0,0,0,0,0 0|2:1,0,3,0:4:20:0|1:3633_T_C:108,111,139,0,29,20,111,139,29,139:3633 2/2:0,0,4,0:4:13:.:.:181,181,181,13,13,0,181,181,13,181 0|1:2,7,0,0:9:62:0|1:3628_CGGGATGGGACAGAGATCCCT_C:289,0,62,295,84,379,295,84,379,379:3628 ./.:13,0,0,0:13:.:.:.:0,0,0,0,0,0,0,0,0,0 1/2:0,3,3,0:6:96:.:.:196,99,116,104,0,96,202,114,105,214 ./.:28,0,0,0:28:.:.:.:0,0,0,0,0,0,0,0,0,0 2/2:0,0,6,0:6:18:.:.:245,245,245,18,18,0,245,245,18,245 0|1:8,5,0,0:13:99:0|1:3628_CGGGATGGGACAGAGATCCCT_C:188,0,283,212,300,511,212,300,511,511:3628 1/2:1,9,2,0:12:56:.:.:434,59,56,353,0,373,437,84,378,462 ./.:11,0,0,0:11:.:.:.:0,0,0,0,0,0,0,0,0,0
chr_1 3736 . G A,* 232.35 . AC=2,1;AF=0.053,0.026;AN=38;BaseQRankSum=1.05;DP=229;ExcessHet=3.3995;FS=2.781;InbreedingCoeff=-0.1209;MLEAC=2,1;MLEAF=0.053,0.026;MQ=34.01;MQRankSum=1.56;QD=5.16;ReadPosRankSum=1.25;SOR=1.312 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:3,0,0:3:9:.:.:0,9,74,9,74,74 0/0:10,0,0:10:30:.:.:0,30,327,30,327,327 0/0:6,0,0:6:18:.:.:0,18,241,18,241,241 0/0:25,0,0:25:66:.:.:0,66,990,66,990,990 0|1:7,3,2:12:99:0|1:3628_CGGGATGGGACAGAGATCCCT_C:105,0,270,126,279,405:3628 0/0:5,0,0:5:15:.:.:0,15,140,15,140,140 0/0:5,0,0:5:15:.:.:0,15,150,15,150,150 0|2:15,1,6:22:99:0|1:3633_T_C:203,248,841,0,592,574:3633 0/0:9,0,0:9:27:.:.:0,27,310,27,310,310 0/0:2,0,0:2:6:.:.:0,6,37,6,37,37 0/0:15,0,0:15:42:.:.:0,42,630,42,630,630 0|1:7,4,0:11:99:0|1:3628_CGGGATGGGACAGAGATCCCT_C:147,0,267,168,279,447:3628 0/0:3,0,0:3:9:.:.:0,9,92,9,92,92 0/0:9,0,0:9:27:.:.:0,27,319,27,319,319 0/0:39,0,0:39:80:.:.:0,80,1355,80,1355,1355 0/0:10,0,0:10:0:.:.:0,0,151,0,151,151 0/0:17,0,0:17:10:.:.:0,10,534,10,534,534 0/0:9,0,0:9:24:.:.:0,24,360,24,360,360 0/0:16,0,0:16:29:.:.:0,29,536,29,536,536
if you want to have the numbers, on the output file of my command I call ~30.3 million SNPs, and of this 2,5% are type 5 sites.
I hope this helps,
-
Hi ABours
Looking into the code it seems like the tool is doing what its supposed to. To clarify, when we say MIXED we mean SNP/Indel + Symbolic. Symbolic variation is represented like this <*>.
SelectVariants actually ignores spanning deletion, which is represented by *.
I agree this is confusing, but I hope this explanation helps,
-
Hi Bhanu,
Ok, it's nice to know that the tool does what the code says. also the confusion remains, but I can work with it.
I would say I'm just questioning whether the code is logical for what one expects. Of course, I realize that I'll have to work with what I've got (at least at the moment). I mainly just wanted to provide some feedback/suggestion to one of your tools in your very nice toolset.
Thanks for check this and best,
-
Hello, I am also having this issue. This page indicates the following definition of SYMBOLIC:
SYMBOLIC (such as the
<NON-REF>
allele used in GVCFs produced by HaplotypeCaller, the*
allele used to signify the presence of a spanning deletion, or undefined events like a very large allele or one that's fuzzy and not fully modeled; i.e. there's some event going on here but we don't know what exactly)Therefore I would expect SelectVariants --select-type-to-exclude SYMBOLIC to not have any calls containing spanning deletions. However, the output VCFs still do (in gatk4 4.2.0.0).
This causes problems for downstream tools like FastaAlternateReferenceMaker:
java.lang.IllegalArgumentException: the input sequence contains invalid base calls like: *
Is there any way to force GATK to exclude spanning deletions when filtering a VCF? -
Matt Johnson You can use a JEXL expression with --SelectExpressions to filter out the spanning deletions. Read more about JEXL expressions here: https://gatk.broadinstitute.org/hc/en-us/articles/360035891011-JEXL-filtering-expressions
-
Matt Johnson we tested this issue further and we were not able to get it to work with the JEXL expressions so I created an issue ticket so that our team can get this functionality working: https://github.com/broadinstitute/gatk/issues/7341
Thanks for bringing it to our attention!
Please sign in to leave a comment.
8 comments