Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Mutect2 - sum of FAD greater than sum of AD

0

6 comments

  • Avatar
    SkyWarrior

    Hi Gero Doose

    Looking at the header of the vcf file created by Mutect2 FAD does not seem to be a measure of filtered allelic depths

    ##FORMAT=<ID=FAD,Number=R,Type=Integer,Description="Count of fragments supporting each allele.">

    Change in numbers for fragments vs allelic depths could be due to the local reassembly and pairHMM. 

    https://gatk.broadinstitute.org/hc/en-us/articles/360035532252-Allele-Depth-AD-is-lower-than-expected 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Gero Doose

    Hi SkyWarrior,

    Thank you for your explanation. Indeed, FAD refers to the fragment counts while AD refers to read counts. The case where I would expect those values to be different is on overlapping reads in paired-end sequencing. And I would expect hat the count of fragments is less or equal to the number of reads.

    Following your pointers, I find in the "Notes to Mutect2" document that "This capping of overlapping base qualities occurs before Pair-HMM, while merging reads into fragments occurs after Pair-HMM." So while it might be related to Pair-HMM, I still do not understand how the FAD can be bigger than the AD.

    Regards.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    @Gero Doose This is expected.  Supposing that read pairs don't overlap one would expect AD and FAD to match exactly.  However, only informative reads / fragments are counted, and it is occasionally possible that an individual read is too ambiguous as to which allele it supports while the fragment is not ambiguous.

    For example, suppose a read ends in the middle of an STR and we don't know how many repeats it supports, but its mate contains a SNV that is phased with a particular number of repeats.  Then the fragment, but alone the read alone, is informative.

    0
    Comment actions Permalink
  • Avatar
    Gero Doose

    Hi David Benjamin,

    Thank you very much for providing a possible explanation! If I understand correctly, this occurs when overlapping read pairs are merged into one more informative long read (representing the complete fragment). 

    In my data I see other cases, where there are no overlapping read pairs. What could be the explanation here? 

    Here is one such example: We have a potential G-C single-nucleotide variant with the Mutect2 VCF output reporting the allelic depths AD=1,7 and FAD=3,10. The image shows the original BWA-MEM alignments in IGV (reads shown as pairs), and one can see that originally 7 reads support the C and there is one supporting T, resulting in a total depth of 8. The sum of the FAD is 13, and since there is no overlap between the read pairs, I wonder how they are obtained.

    Would it be possible to see the additional fragments leading to the higher FAD values somewhere? I looked into the Mutect2 bamout, but I could not find additional reads at this position (beside the ArtificialHaplotypeRGs).

    Kind regards

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Gero Doose Actually, my point above does not rely on the overlap of read pairs.  It's not that Mutect2 needs mates to overlap in order to create a merged pseudo-read that is more informative than either.  Rather, even without overlap the alignment of the mate (for clarity let's call this read 2) can determine which locally-assembled haplotype the pair as a whole comes from, and the information from that haplotype then may resolve ambiguity in the vicinity of read 1.

    Concretely, suppose we have two variant loci each with one alt allele.  The first loci is an AAAAAAAA homopolymer with an insertion of an extra A, and the second is a T->G SNP.  Now suppose read 1 ends in the middle of the homopolymer and is therefore, by itself, totally uninformative.  But suppose that our assembly reveals that the insertion allele is in phase with the G SNP allele.  Then if read 2 exhibits the G allele at the SNP locus the fragment as a whole is unambiguously informative of both the SNP and the extra A insertion at the homopolymer!

    Crucially, this is not just an abstract point but actually reflects how Mutect2 performs somatic genotyping.  It knows about read pairs internally.

    I don't guarantee that your scenario is quite as straightforward as the toy example I gave.  In general, for example, the GATK assembles more haplotypes than it ought to except when using --linked-de-bruijn-graph mode, which is the wonderful work of my colleagues Kiran Garimella and James Emery.  However, the flow of information from the mate to the haplotype to the original read is the gist.

    0
    Comment actions Permalink
  • Avatar
    Gero Doose

    Hi David Benjamin,

    Thank you for providing such a thorough explanation! Your response has significantly clarified my understanding of how even non-overlapping read pairs can impact somatic genotyping through haplotype assembly, potentially leading to instances where FAD values exceed AD values. I truly appreciate the time you took to share your expertise on this matter. 

    Best regards

    Gero Doose

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk