I'm subsetting the vcf that I got from GenotypeGVCFs into separate SNP and INDEL vcfs (to hard-filter), with SelectVariants from GATK version 126.96.36.199. However, the program doesn't behave how I expect it to, based on the following documentation:
In the 1st doc you specifically note (under subheader "Variant manipulation") that when running --select-type-to-include SNP with SelectVariants the tool will only pull out "pure" SNPs and that any mixed sites will not be in there. I check the behaviour of SelectVariants on the following types of sites and whether it's pulled by select-type <blank>:
- pure SNP -> pulled by select-type SNP
- pure deletion -> pulled by select-type INDEL
- mixed -> pulled by select-type MIXED
- spanning deletion and insertion -> pulled by select-type MIXED
- spanning deletion and snp -> pulled by select-type SNP
So for types 1 to 3 everything is alright. But type 4 and 5 are in contradiction to each other. Let me explain:
Either a site with a spanning deletion is considered a MIXED site and it's correct that type 4 is found by using select-type MIXED (and no other), however that would mean that also type 5 should be pulled by select-type MIXED (which it does not, I checked).
Or a site with a spanning deletion is considered a "light-variant" of a pure SNP or INDEL. This means that the behaviour seen by select-type SNP pulling out type 5 is correct, but type 4 should also be pulled out by doing select-type INDEL (which SelectVariants doesn't). Of course, if this is the case, than you should reconsider whether a spanning deletion will be pulled by select-type MIXED (as I saw with type 4).
Personally, I see a site with a spanning deletion and a snp as a MIXED site (because the spanning deletion inherently implies an INDEL in that region). Thus, from my point of view and how I read your docs the -select-type SNP unnecessarily inflates my (subsetted) pure SNP vcf by adding type 5 sites. For people with small genomes and/or low amount of samples (and high computer storage) this is not a big problem, but when the genomes become bigger and more samples are added, having a unnecessarily big SNP vcf (containing also type 5 sites) just takes up storage and computing time. Maybe you could make select-type SNP only pull pure SNPs (as select-type INDEL does) and add another function/select-type to SelectVariants that will specifically pull out SNPs and their "light-variant" a.k.a. type 5 from list above and one that specifically pulls out INDELs and their "light-variant" a.k.a. type 4 from list above.
I hope that my post is clear and that you can adjust the behaviour of SelectVariants, and that you maybe consider my suggestions.
Please sign in to leave a comment.