Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Can CalculateGenotypePosteriors be used on hard-filtered VCF

0

4 comments

  • Avatar
    Gökalp Çelik

    Hi Hugo DENIS

    Aside from the experimental design part I want to ask a question about GQs. How was this VCF generated? What is the variant caller used to call genotypes? 

    CalculateGenotypePosteriors requires unbiased Genotype Likelihoods to be present in VCF. Does this VCF have it in the form of GL or PL FORMAT fields?

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Hugo DENIS

    Hi, Thank you again for your help. 

    I created the vcf file using GATK best practice pipeline :

    1 HaplotypeCaller >  (all reference)

    2 GenomicDBimport > Create genomic database (chromosomes only, split by 4 or 2 intervals depending on size)

    3 GenotypeGVCF >  (per intervals)

    4 GatherVCFs > single vcf file with all chromosomes

    5 SelectVariants > I selected SNP variants type only.

    6 VariantFiltration

    Yes the VCF file has the PL Format field, at least for the individuals where it can be computed. See an example below for a random variant (before any filtration applied). 

    Kind regards, 

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Hugo DENIS

    Checking the Genotype fields for your samples it looks like GQ's are quite well calibrated for that particular depth in your samples therefore there does not seem anything wrong with them. If you wish to use the tool we may need to remind you about some of the issues that you may face. 

    1- Since you are working with a non-model organism you may not have a resource file for the common variants of your species. CalculateGenotypePosteriors boosts the GQ of those sites that are found within the resource file and reduces the GQ of those that are not found. 

    2- If you still wish to use the tool  you will be using it just with those variants found in the VCF and most likely those that are found in multiple samples will get boosted and those that are singletons may get reduced to a HOMREF site. This may still be useful for you since your aim is to use it in GWAS and you will still get good quality common sites and probably get rid of many artifactual singletons. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Hugo DENIS

    Hi, 

    Thank you for your answer. It therefore seems that this tool would not suit our target species. 

    If genotype qualities are well calibrated, I also noticed something odd in our vcf file produced by this pipeline : the average site coverage has dropped from 6.8x in the bam files (all chromosomic regions) to 5.7x in the vcf file (SNP only). I am not sure why this is the case.

    Would you expect variable regions to have lower depth than non variable regions because of lower reads mapping rates ?

    If the site depth was higher, the genotype qualities would also probably be higher for many individuals at many sites. 

    Thank you

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk