SelectVariants --discordance not working as I expected
AnsweredHi!
a) I'm using GATK4-4.2.0.0-1
b) gatk SelectVariants -V $combined.vcf.gz -R $genome --discordance $wildtype.vcf -O $discordant.combined.vcf.gz
c) Why do I see (......)? But I don't retrieve any variants (zero).
I'll explain from the beginning. So I have around 8 bam datasets, 7 alleged mutants and 1 from a wild-type parental strain of the other 7. I ran HaplotypeCaller on each bam file using the arguments -ERC GVCF -ploidy 1 because I don't expect differences in ploidy and they're haploid. Then, I combined all the vcf files with CombineGVCFs; genotyped them with GenotypeGVCFs; and filtered the SNPs using these criteria:
-filter "QD < 20.0" --filter-name "QD20" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
Now, what I want to do is to remove all the variants that are present in the wild-type vcf track; or take the variants that are absent in the wild-type, same difference. And for that, I thought about using
gatk SelectVariants -V $combined.vcf.gz -R $genome --discordance $wildtype.vcf -O $discordant.combined.vcf.gz
Where $combined.vcf.gz is the combined file I got after combining, genotyping, and filtering, $genome is my reference genome, and $wildtype.vcf is the initial vcf file I produced with HaplotypeCaller for the wild-type bam dataset.
The thing is I get 0 variants back, and I can see there are discordant variants (variants that are present in one or more of the mutants, but not in the wildtype) using IGV and looking at the combined vcf.
I also tried running something similar using the individual files generated by HaplotypeCaller in pairwise comparisons with the wild-type track, and I also get 0 variants back, so I must be definitely doing something wrong.
By the way, if I use --concordance instead, I get ALL the variants, even though some are clearly not concordant.
Thank you for your help,
Carlos
-
Hi Carlos P Arques,
I brought up this issue with some members of the GATK team this morning and we are thinking that we should file a Github ticket to look into the --discordance argument and potential reasons why the argument is not giving you what you want. However, there are a few suggestions that you can try:
1. If the wild-type file you are using is actually a gvcf file, SelectVariants would be likely to fail. Could you affirm whether the file is a gvcf or vcf?
2. The argument may be causing SelectVariants to look at discordant sites rather than discordant variants at the genotype level. Could you try still specifying --discordance for the wild-type file but also specifying --sample-name for each individual sample name including the wild-type?
3. The last thing that was suggested was that you could potentially achieve what you are wanting by specifying SelectVariants -V conbined.vcf -XL wildtype.vcf to exclude the wild-type variants without using --discordance.
Please let me know if any of these suggestions reveal a different output and I will keep you updated on anything the GATK team is able to figure out.
Kind regards,
Pamela
-
Hi Pamela Bretscher,
Thank you for looking into it.
1. The wild-type was a VCF file, but when I examined it closely I found out that there were a lot of positions that had a GT = 0, i.e. same as the reference. So possibly, all the files I was comparing had the same positions marked as variants, even though some of them had a GT = 1 (real variant) and others GT = 0. But, because the position was in the VCF file, they weren't discordant.
I tried a couple of things that worked for me. I extracted all the individual files from the gvcf using the arguments --exclude-non-variants and --remove-unused-alternates, to ensure that the resulting VCF files had only real variants for that sample (GT = 1). Then, I used the --discordance argument as I intended and as far as I can tell, it worked. I hope this can help someone, and sorry for not posting it sooner.
I will try your third solution and see if I get similar results.
Thanks a lot,
Carlos
-
Hi Carlos P Arques,
Thank you for providing your solutions, this will be very helpful for other users as well as the GATK team.
Kind regards,
Pamela
Please sign in to leave a comment.
3 comments