Can CalculateGenotypePosteriors be used on hard-filtered VCF
Hi,
I was wondering if it is possible/makes sense to use CalculateGenotypePosteriors tool on a hard-filtered multi-sample vcf file ?
I have a large multi-sample vcf file (>800 individuals) from a non-model organisms so I can't use VQSR tool. I wanted to apply hard-filtering (including on genotype qualities to conduct a GWAS) but the GQ are suprisingly low (70% GQ<20) considering coverage and input bam files statistics.
I was wondering if there could be any bias in GQ estimates and whether using CalculateGenotypePosteriors could therefore improve their accuracy.
Thank you again for the continuous support.
-
Hi Hugo DENIS
Aside from the experimental design part I want to ask a question about GQs. How was this VCF generated? What is the variant caller used to call genotypes?
CalculateGenotypePosteriors requires unbiased Genotype Likelihoods to be present in VCF. Does this VCF have it in the form of GL or PL FORMAT fields?
Regards.
-
Hi, Thank you again for your help.
I created the vcf file using GATK best practice pipeline :
1 HaplotypeCaller > (all reference)
2 GenomicDBimport > Create genomic database (chromosomes only, split by 4 or 2 intervals depending on size)
3 GenotypeGVCF > (per intervals)
4 GatherVCFs > single vcf file with all chromosomes
5 SelectVariants > I selected SNP variants type only.
6 VariantFiltration
Yes the VCF file has the PL Format field, at least for the individuals where it can be computed. See an example below for a random variant (before any filtration applied).
Kind regards,
-
Hi Hugo DENIS
Checking the Genotype fields for your samples it looks like GQ's are quite well calibrated for that particular depth in your samples therefore there does not seem anything wrong with them. If you wish to use the tool we may need to remind you about some of the issues that you may face.
1- Since you are working with a non-model organism you may not have a resource file for the common variants of your species. CalculateGenotypePosteriors boosts the GQ of those sites that are found within the resource file and reduces the GQ of those that are not found.
2- If you still wish to use the tool you will be using it just with those variants found in the VCF and most likely those that are found in multiple samples will get boosted and those that are singletons may get reduced to a HOMREF site. This may still be useful for you since your aim is to use it in GWAS and you will still get good quality common sites and probably get rid of many artifactual singletons.
I hope this helps.
-
Hi,
Thank you for your answer. It therefore seems that this tool would not suit our target species.
If genotype qualities are well calibrated, I also noticed something odd in our vcf file produced by this pipeline : the average site coverage has dropped from 6.8x in the bam files (all chromosomic regions) to 5.7x in the vcf file (SNP only). I am not sure why this is the case.
Would you expect variable regions to have lower depth than non variable regions because of lower reads mapping rates ?
If the site depth was higher, the genotype qualities would also probably be higher for many individuals at many sites.
Thank you
Please sign in to leave a comment.
4 comments