Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Haplotype caller missed variant

0

12 comments

  • Avatar
    Bhanu Gandham

    Hi,

     

    Looks like this discussion is already started in this thread: https://gatk.broadinstitute.org/hc/en-us/community/posts/360067437992-Haplotypecaller-missed-variants-in-HLA-genes

    We suggest that all discussions about a single topic be in the same thread. This helps us get to these queries faster. Let's continue our discussions in that thread.

    0
    Comment actions Permalink
  • Avatar
    ensel

    Hi, Bhanu

    This post is different from the  https://gatk.broadinstitute.org/hc/en-us/community/posts/360067437992-Haplotypecaller-missed-variants-in-HLA-genes.
    Previous post is about missing the calls with low VAF, and this post is about the variants located in the end of a haplotype.

    In the figure, the called & missed variants have similar VAF, but the variant in the end of the haplotype was missed while the others were called.

    The missed variant was missed again with higher sensitivity to low VAF by setting ploidy = 3.

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Hi ensel

    Was the BAM file from above generated via `-bamout` or is it the source BAM? If from -bamout, please upload it somewhere, I will try to have a look at it.

    0
    Comment actions Permalink
  • Avatar
    ensel

    Hi Danilovkiri,

     

    The BAM file from -bamout was uploaded in the ftp server.

    NGB_HLA_DQB1-2.tar.gz

    There are the haplotypes bearing the variant, but the vcf file missed the variant.

    Thank you for your helps!!! 

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    I'm sorry but I do not understand which ftp server you are talking about. At least the ftp link to the file would help.

    0
    Comment actions Permalink
  • Avatar
    ensel

    Sorry for no information about the ftp server.

    The ftp server for bug report.

    location: ftp.broadinstitute.org
    username: gsapubftp
    password: 5WvQWSfi

     

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi ensel

     

    The ftp site you are referring to is only available internally to the Broad Institute. To share with danilovkiri I suggest you put the file in a google bucket and share the bucket address.

     

    0
    Comment actions Permalink
  • Avatar
    ensel

    Hi Danilovkiri,

     

    You can download the files from

    NGB_HLA_DQB1-2.tar.gz

     

    IP :61.74.252.40    port : 16021

    ID : openftp    pw: ftp1234

    0
    Comment actions Permalink
  • Avatar
    ensel

    Hi Danilovkiri,

     

    I found another data where Haplotypecaller missed the variant in the end of a haplotype.

    corresponding bam files were uploaded in the ftp server.

    Please refer to the bed file to find the position of missed variants.

    file : NGB_HLA.tar.gz

    sample_1 : IHW09247 

    sample_2 : CRS-123

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    ensel I can't find the variations you are talking about. When attaching a screenshot, please include the track with positions to know exactly where to look.

    0
    Comment actions Permalink
  • Avatar
    ensel

    Hello, Danilovkiri

    When using IGV

    load Genome from file 'HLA-total.refgene.fasta'

    The positions of the missed variants are

    sample 1 : HLA-DRB1 : 6,472 C>A

    sample 2 : HLA-DQB1 : 6,024  G>A

    Thank you !!!

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Hi ensel

    Sure thing I loaded the reference genome. It's just difficult to look for variations based on solely the depth of coverage profile available on your screenshots.

    I had a look at your BAM files. First, I hope you are familiar with HC mechanics, i.e. it defines active regions, creates artificial haplotypes, maps all reads against all artificial haplotypes, defines which read best maps which artificial haplotype. In the end, you have a bunch of artificial haplotypes and supporting reads for them.

    The --bamout BAM file contains both artificial haplotypes (represented as reads with specific tag values: Sample = HC, Read Group = ArtificalHaplotypeRG-NNNNNN..., HC = integer) and reads (Sample = Reads, Read Group = reads-NNNNNNN...). Reads which do have support for either of artificial haplotypes have an HC tag with exactly the same value as the artificial haplotype they are supporting.

    To better understand what happens in the BAM file, I suggest you:

    1. Sort alignments by -> tag (user input: "HC")

    2. Colour alignments by -> read group

    3. Group alignment by -> tag (user input: "HC")

    This will enable you to see grouped blocks of alignments: each block will contain an artificial haplotype (colour 1) and supporting reads (colour 2) IF PRESENT). More info on definitions of supporting reads and etc can be found at the HC Documentation

    For HLA-DRB1:6472 C>A I see the following: an artificial haplotype with HC=-2054697616 and a number of supporting reads covers this position. No C>A variation is present in this haplotype. Other artificial haplotypes (which have C>A variation at position 6472) have a worse mapping (more variations, numerous gaps) and often do not have supporting reads. Eventually, when HC uses Pair-HMM to calculate probabilities of variant sites give the supporting reads data, it selects the most probable sites. In your example I can see that the evidence for variation in HLA-DRB1:6472 is minor. Hence the variation was not called since the selected artificial haplotypes did not contain that variation.

    The same applies to HLA-DQB1:6024 G>A (see the screenshots). 

    However, since you are using PCR amplicons for alignment, I can say that this is definitely not the best input data for HC and I suppose it was not tested on such data. I definitely advise you to look at the documentation (specifically all the parameters which set various thresholds). HC might just filter some reads, not take them into account/whatever. Base quality is also important since HC has a threshold for it (btw, how did you set the base quality for PCR data?). It might be useful to adjust these parameters (probably lower the default thresholds) and see what happens. Of course, it is useful to do only in case you have prior knowledge of these two variations we are talking about. If you don't, I would not be so sure to state that HC missed variants.

    Good luck!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk