Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Read counts in vcf files do not reflect local realignments

0

2 comments

  • Avatar
    Tiffany Miller

    Hi jhb , yes, please provide your pipeline if you can. I will write back with a more thorough answer once we have that. I wanted to pass along this doc in case it supports what you are seeing with your ADs. 

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hi jhb

    Sorry for the delay, but here are responses to your questions:

    1) Is the listing of allele depths from before local realignment a universal feature of the GATK pipeline, or is it something that has happened uniquely in my analysis, or because I mistakenly put in the wrong flag somewhere along the way?

    The allele depths are given post local realignment and the math is explained here (the article I already shared above): https://gatk.broadinstitute.org/hc/en-us/articles/360035532252-Allele-Depth-AD-is-lower-than-expected
    In the vcf, if a read is considered uninformative it is counted towards the DP, but not the AD. Though uninformative reads are not reported in the AD, it is still used in calculations for genotyping. If a read is considered informative, it gets counted toward the AD and DP of the variant allele in the output record.


    2) If this is unique to me, how can I get GATK to emit post-local-realignment allele depths?
    N/A

    3) Ideally, VCFs should include post-local-realignment allele depths. If that is not possible for some reason, they should include no allele depths at all. The ones that are provided are worse than useless, because they are not describing the same loci as the genotypes! Anyone who wants to do post-GATK filtering like eliminating loci with ultra-high read depth (a pretty common filtering step) is unable to do so, but probably thinks that he or she can. This is a problem.
     
    As mentioned above all the post local realignment reads are used for calculating genotypes, so they are in fact describing the same loci as the genotypes. Ultra-high read depth is not dependent on alleles, so the site-level DP could also be used, which does not depend on any read likelihoods.

    4) If this is a general feature of GATK, are downstream GATK tools being tricked by this? It is possible to do variant filtration based on “AD” and “DP”, so I suspect that they are, at least in those cases.
     
    The gnomAD team has successfully created a high-quality callset by filtering on AD and DP as follows: high-quality variants have depth >= 10, genotype quality >= 20 and minor allele balance > 0.2 for heterozygous genotypes -- see "AC_adj" in https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk