SelectVariants discordance gives no discordance
This is my first time using GATK. The SelectVariants discordance function returns no unique alleles between two individuals for me that have clear differences when viewing them in Tassel GUI. Wondering what I'm doing wrong. Details are below.
I've generated a large vcf file containing >900 individuals with command line Tassel. Depth filtered with vcftools, missingness taxa and site filtered with Tassel GUI, MAF a HW filtered with vcftools. Reduced the vcf file to two individual files of taxa I'm interested in as larger ones were not working either. I've tried a number of things but the current working directory has
- the reference fasta that was used to generate the original vcf
- a reference fai index generated with samtools
- a reference dictionary generated with gatk's CreateSequenceDictionary
- the two individual vcf's
- a index for what would be the multi taxa vcf (other_ind.vcf) using gatk's IndexFeatureFile
The on-screen report is below. I've changed file names and directories as to protect information incase something might not jive in that area. The output file has what looks like normal header information but reports nothing where you would expect variants. Thanks for any help in advance.
[jordan.brungardt@ceres19-compute-62 trblsht2]$ gatk SelectVariants -R CillMX87.fa -V ind_OI.vcf --discordance other_ind.vcf -O out
Using GATK jar /project/software/el9/apps/gatk/4.4.0.0/gatk-package-4.4.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /project/software/el9/apps/gatk/4.4.0.0/gatk-package-4.4.0.0-local.jar SelectVariants -R CillMX87.fa -V ind_OI.vcf --discordance other_ind.vcf -O out
08:49:50.677 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/project/software/el9/apps/gatk/4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
08:49:50.778 INFO SelectVariants - ------------------------------------------------------------
08:49:50.781 INFO SelectVariants - The Genome Analysis Toolkit (GATK) v4.4.0.0
08:49:50.782 INFO SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
08:49:50.782 INFO SelectVariants - Executing as jordan.brungardt@ceres19-compute-62.scinet.local on Linux v5.14.0-362.24.2.el9_3.x86_64 amd64
08:49:50.782 INFO SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v17.0.11+9
08:49:50.782 INFO SelectVariants - Start Date/Time: May 15, 2024 at 8:49:50 AM CDT
08:49:50.783 INFO SelectVariants - ------------------------------------------------------------
08:49:50.783 INFO SelectVariants - ------------------------------------------------------------
08:49:50.788 INFO SelectVariants - HTSJDK Version: 3.0.5
08:49:50.788 INFO SelectVariants - Picard Version: 3.0.0
08:49:50.788 INFO SelectVariants - Built for Spark Version: 3.3.1
08:49:50.789 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
08:49:50.789 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
08:49:50.790 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
08:49:50.790 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
08:49:50.790 INFO SelectVariants - Deflater: IntelDeflater
08:49:50.790 INFO SelectVariants - Inflater: IntelInflater
08:49:50.791 INFO SelectVariants - GCS max retries/reopens: 20
08:49:50.791 INFO SelectVariants - Requester pays: disabled
08:49:50.792 INFO SelectVariants - Initializing engine
08:49:51.116 INFO FeatureManager - Using codec VCFCodec to read file file:/other_ind.vcf
08:49:51.186 INFO FeatureManager - Using codec VCFCodec to read file file:/ind_OI.vcf
08:49:51.202 WARN IndexUtils - Feature file "file:/other_ind.vcf" appears to contain no sequence dictionary. Attempting to retrieve a sequence dictionary from the associated index file
08:49:51.213 INFO SelectVariants - Done initializing engine
08:49:51.234 INFO SelectVariants - Selecting only variants discordant with the track: /other_ind.vcf
08:49:51.245 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "". Defaulting to VCF.
08:49:51.274 INFO ProgressMeter - Starting traversal
08:49:51.275 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
08:49:51.736 INFO ProgressMeter - CM029010.1:5123612 0.0 7738 1031733.3
08:49:51.737 INFO ProgressMeter - Traversal complete. Processed 7738 total variants in 0.0 minutes.
08:49:51.919 INFO SelectVariants - Shutting down engine
[May 15, 2024 at 8:49:51 AM CDT] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=285212672
-
Have you removed HOMREF calls from both sets before you used SelectVariants with discordance option?
If both sets contain same sites then it may not be possible to pick discordant calls from one set when compared to the other.
I hope this helps.
-
Wishful thinking strikes again. It sounded like this was for selecting alleles unique to an individual. If discordance is filtering for sites unique sites/loci than it should be working fine. Is there a GATK function for what I'm looking for? It sounds like there's a third party script for vcftools that selects unique alleles so I might go that route.
-
Instead of a single step work for this job you may be able to split this whole work into multiple steps.
1- Annotate your VCF files with VariantAnnotator and include SampleList annotation therefore you may be able to see which samples contain the alt allele.
2- Convert your VCF files to table and make sure that SampleList annotation is present in the columns.
3- You may be able to filter your variants using any tool you want such as R or python or bash etc. to get sites that are unique for each taxa.
I hope this helps.
-
Thanks for the reply Gökalp,
The vcf-tools vcf-contrast function works nicely for one-to-many individuals analysis when reducing the vcf file to only individuals used in the query. It seems to misbehave when doing several-to-many comparisons as far as I can tell. It still cuts down on a lot of work. Seems like this should be a very relevant research question. Maybe a function that can be added to GATK in the future.
Please sign in to leave a comment.
4 comments