Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

AD information conflicted with input bams for HaplotypeCaller

0

6 comments

  • Avatar
    danilovkiri

    Hi Sin Lee

    It is sometimes incorrect to compare GATK results and the original BAM supplied to GATK HC. Check out https://gatk.broadinstitute.org/hc/en-us/community/posts/360068136032-Multiple-cases-where-GATK4-is-not-giving-correct-variant-calls and https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant

    The most probable issue is local realignment which realigns reads in an active region - a region where variations are observed. To get a BAM file with realigned reads and assembled haplotypes use --bamout option. Please read the GATK HC documentation for reference.

    PS: I guess you should have named your question in a different way since it has nothing to do with CombineGVCFs.

    0
    Comment actions Permalink
  • Avatar
    Sin Lee

    @danilovkiri 

    Thanks for your quick reply. I've checked the realigned bam file generated by HaplotypeCaller, it seems that the AD information is in accordance with the bam file.

    But it is still confusing that after realignment, this site became tri-allelic with AD 6, 19, 10 for ref (T), alt 1 (G) and alt 2 (*) respectively, it's hard to filter any allele by minor allele frequency or allele count, and such tri-allelic sites seem abnormal.

    I'll appreciate it if you have any suggestions on this.

    Best regards,

    Sin Lee

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Please, read the latest comment at https://gatk.broadinstitute.org/hc/en-us/community/posts/360067451372-Haplotype-caller-missed-variant?page=1#community_comment_360010873331

    It explains how to sort and colour the bamout BAM file to get the idea of what happens there. The bamout file contains all the reads from the original BAM for all active regions. However, there are also haplotypes represented as reads and haplotype-supporting realigned reads. It is not clear on your screenshot what these reads are and which haplotype they are related.

    AS for the asterisk in the ALT VCF field, I hope you know it is reserved to indicate that the allele is missing
    due to an overlapping deletion which you see in your bamout. Why do suppose it to be abnormal?

    0
    Comment actions Permalink
  • Avatar
    Sin Lee

    danilovkiri

    Thanks for your suggestions. I read the past post with similar situation carefully. Then I sorted and colored the bamout file as is shown beneath.

     

    I know it clearly that '*' is reserved to indicate the allele is missing due to a upstream deletion, but it is still confusing that in a single diploid sample, a tri-allele site is weird. It is convenient to choose the two allele with higher depth or allele frequency, but another allele is considerable when its allele frequency is about 0.17. And such a high percentage (about 0.005) of such '[ATCG]/N' genotype in the population seems also not popular.

    I fully understand that it's not easy to deal with such Poly-A structure, and thanks for your time.

    Best regards.

    Sen Li

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Well, look. There is a reference allele which is always present in a VCF (T), and 2 ALT alleles (G and a spanning deletion). The genotype (GT) for your sample is 1/2 (given a diploid condition), so there are two alleles present for this position in the genome: G and DEL. I don't see any problem here. The reference allele T has an index `0`, which is not present in the GT FORMAT field, so it is apparently not present in the genome.

    Of course, it is a probabilistic approach and you can see that the AD for the REF allele T is nonzero, though given the Bayesian approach the probability of a REF allele being truly present in the derived genotype is close to zero. GATK uses 2n ploidy by default, and you may observe tetraallelic sites (and more), but eventual genotype will include only 2 alleles max. The larger the number of samples, the more potential ALT alleles you might eventually find in the ALT field since it accumulates all ALT alleles for all samples in the VCF. For a single sample, it is ok to observe triallelic sites when GT is 1/2.

    0
    Comment actions Permalink
  • Avatar
    Sin Lee

    danilovkiri

    Thanks for your quick reply and patient ! It's clear for me now.

    Best regards,

    Sin

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk