gatk SelectVariant identified no snps in the vsf file while it has 63702926 snps
My goal is to subset snps from the input file (AA01_rg.g.vcf). Before using the gatk SelectVariants for this purpose, I used the following command to count number of snps in my input file:
grep -v "#" /scratch/vn81649/snapbean/gvcf_new/AA01_rg.g.vcf | wc -l
This command identified 63702926 snps in the input file.
However, when I used gatk SelectVariants, the output file exhibited no SNPs.
REQUIRED for all errors and issues:
a) GATK version used:
The Genome Analysis Toolkit (GATK) v4.2.5.0
b) Exact command used:
ref="/scratch/vn81649/snapbean/Ref/ref.fa"
input="/scratch/vn81649/snapbean/gvcf_new/AA01_rg.g.vcf"
output="/scratch/vn81649/snapbean/snps/AA01_rg.snps.vcf"
time gatk SelectVariants -R $ref -V $input --select-type-to-include SNP -O $output
c) Entire program log:
13:48:37.648 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/apps/eb/GATK/4.2.5.0-GCCcore-8.3.0-Java-1.8/gatk-package-4.2.5.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Feb 06, 2023 1:48:37 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
13:48:37.829 INFO SelectVariants - ------------------------------------------------------------
13:48:37.829 INFO SelectVariants - The Genome Analysis Toolkit (GATK) v4.2.5.0
13:48:37.829 INFO SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
13:48:37.831 INFO SelectVariants - Executing as vn81649@c4-11.compute.lan on Linux v3.10.0-1160.36.2.el7.x86_64 amd64
13:48:37.831 INFO SelectVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_241-b07
13:48:37.831 INFO SelectVariants - Start Date/Time: February 6, 2023 1:48:37 PM EST
13:48:37.831 INFO SelectVariants - ------------------------------------------------------------
13:48:37.831 INFO SelectVariants - ------------------------------------------------------------
13:48:37.831 INFO SelectVariants - HTSJDK Version: 2.24.1
13:48:37.831 INFO SelectVariants - Picard Version: 2.25.4
13:48:37.831 INFO SelectVariants - Built for Spark Version: 2.4.5
13:48:37.832 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:48:37.832 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:48:37.832 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:48:37.832 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:48:37.832 INFO SelectVariants - Deflater: IntelDeflater
13:48:37.832 INFO SelectVariants - Inflater: IntelInflater
13:48:37.832 INFO SelectVariants - GCS max retries/reopens: 20
13:48:37.832 INFO SelectVariants - Requester pays: disabled
13:48:37.832 INFO SelectVariants - Initializing engine
13:48:38.515 INFO FeatureManager - Using codec VCFCodec to read file file:///scratch/vn81649/snapbean/gvcf_new/AA01_rg.g.vcf
13:48:39.755 INFO SelectVariants - Done initializing engine
13:48:39.779 INFO ProgressMeter - Starting traversal
13:48:39.779 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
13:50:40.186 INFO ProgressMeter - unmapped 2.0 0 0.0
13:50:40.186 INFO ProgressMeter - Traversal complete. Processed 0 total variants in 2.0 minutes.
13:50:40.188 INFO SelectVariants - Shutting down engine
[February 6, 2023 1:50:40 PM EST] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 2.04 minutes.
Runtime.totalMemory()=2010120192
Using GATK jar /apps/eb/GATK/4.2.5.0-GCCcore-8.3.0-Java-1.8/gatk-package-4.2.5.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /apps/eb/GATK/4.2.5.0-GCCcore-8.3.0-Java-1.8/gatk-package-4.2.5.0-local.jar SelectVariants -R /scratch/vn81649/snapbean/Ref/ref.fa -V /scratch/vn81649/snapbean/gvcf_new/AA01_rg.g.vcf --select-type-to-include SNP -O /scratch/vn81649/snapbean/snps/AA01_rg.snps.vcf
real 2m6.465s
user 2m5.670s
sys 0m6.761s
end
-
Hi Vishal,
Based on the file extensions, it looks like you are running on a GVCF. Currently, when SelectVariants is run on a GVCF to select snps, the output will always be empty, due to the inclusion of <NON_REF> alleles confusing SelectVariants (see https://github.com/broadinstitute/gatk/issues/7111).
There is a PR to fix this here, but it hasn't been merged yet: https://github.com/broadinstitute/gatk/pull/7193
If you need to do this with SelectVariants, you could either convert your gvcf to a vcf first, or try running on the branch for that pr (mg_gvcf_aware_varianttypesvariantfilter). Though given that it has not yet been merged, definitely "buyer beware". You could also try bcftools or vcftools, one of them might be able to do this subsetting for currently. -
Vishal Negi the PR I reference above was merged recently, so this should be fixed as of 4.4.0.0
Please sign in to leave a comment.
2 comments