Mutect2 - sum of FAD greater than sum of AD
Dear community,
In a VCF file generated by the GATK Best Practices MuTect2 pipeline, I've observed 681 out of 27084 lines where the sum of the Filtered Allelic Depths (FAD) is greater than the sum of the Allelic Depths (AD). (Additionally, in some cases, the Depth of Coverage (DP) values differ between the INFO and FORMAT/SAMPLE sections and is larger than the sum of AD values.) Is this behavior expected or indicative of an issue with the variant calling process? What factors could contribute to these discrepancies?
example line:
21 45970759 . ACTTGCAGCAGACAGG A . . AS_SB_TABLE=67,69|40,39;DP=277;ECNT=3;MBQ=32,32;MFRL=197,198;MMQ=60,60;MPOS=20;POPAF=7.30;TLOD=290.17 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:136,79:0.368:215:43,26:41,33:144,78:67,69,40,39
The VCF is from the “Somatic-SNV-Indels-GATK4” Best Practices pipeline from the terra.bio cloud platform and can be found here:
In addition we also see this phenomenon in our own pipeline using alignments generated with BWA and VCF generated with Mutect2 (Java version: 21.0.1, GATK version: 4.4.0.0).
Thank you in advance for your help.
-
Hi Gero Doose
Looking at the header of the vcf file created by Mutect2 FAD does not seem to be a measure of filtered allelic depths
##FORMAT=<ID=FAD,Number=R,Type=Integer,Description="Count of fragments supporting each allele.">
Change in numbers for fragments vs allelic depths could be due to the local reassembly and pairHMM.
Regards.
-
Hi SkyWarrior,
Thank you for your explanation. Indeed, FAD refers to the fragment counts while AD refers to read counts. The case where I would expect those values to be different is on overlapping reads in paired-end sequencing. And I would expect hat the count of fragments is less or equal to the number of reads.
Following your pointers, I find in the "Notes to Mutect2" document that "This capping of overlapping base qualities occurs before Pair-HMM, while merging reads into fragments occurs after Pair-HMM." So while it might be related to Pair-HMM, I still do not understand how the FAD can be bigger than the AD.Regards.
-
@Gero Doose This is expected. Supposing that read pairs don't overlap one would expect AD and FAD to match exactly. However, only informative reads / fragments are counted, and it is occasionally possible that an individual read is too ambiguous as to which allele it supports while the fragment is not ambiguous.
For example, suppose a read ends in the middle of an STR and we don't know how many repeats it supports, but its mate contains a SNV that is phased with a particular number of repeats. Then the fragment, but alone the read alone, is informative. -
Hi David Benjamin,
Thank you very much for providing a possible explanation! If I understand correctly, this occurs when overlapping read pairs are merged into one more informative long read (representing the complete fragment).
In my data I see other cases, where there are no overlapping read pairs. What could be the explanation here?
Here is one such example: We have a potential G-C single-nucleotide variant with the Mutect2 VCF output reporting the allelic depths AD=1,7 and FAD=3,10. The image shows the original BWA-MEM alignments in IGV (reads shown as pairs), and one can see that originally 7 reads support the C and there is one supporting T, resulting in a total depth of 8. The sum of the FAD is 13, and since there is no overlap between the read pairs, I wonder how they are obtained.
Would it be possible to see the additional fragments leading to the higher FAD values somewhere? I looked into the Mutect2 bamout, but I could not find additional reads at this position (beside the ArtificialHaplotypeRGs).
Kind regards
-
Gero Doose Actually, my point above does not rely on the overlap of read pairs. It's not that Mutect2 needs mates to overlap in order to create a merged pseudo-read that is more informative than either. Rather, even without overlap the alignment of the mate (for clarity let's call this read 2) can determine which locally-assembled haplotype the pair as a whole comes from, and the information from that haplotype then may resolve ambiguity in the vicinity of read 1.
Concretely, suppose we have two variant loci each with one alt allele. The first loci is an AAAAAAAA homopolymer with an insertion of an extra A, and the second is a T->G SNP. Now suppose read 1 ends in the middle of the homopolymer and is therefore, by itself, totally uninformative. But suppose that our assembly reveals that the insertion allele is in phase with the G SNP allele. Then if read 2 exhibits the G allele at the SNP locus the fragment as a whole is unambiguously informative of both the SNP and the extra A insertion at the homopolymer!
Crucially, this is not just an abstract point but actually reflects how Mutect2 performs somatic genotyping. It knows about read pairs internally.
I don't guarantee that your scenario is quite as straightforward as the toy example I gave. In general, for example, the GATK assembles more haplotypes than it ought to except when using --linked-de-bruijn-graph mode, which is the wonderful work of my colleagues Kiran Garimella and James Emery. However, the flow of information from the mate to the haplotype to the original read is the gist. -
Hi David Benjamin,
Thank you for providing such a thorough explanation! Your response has significantly clarified my understanding of how even non-overlapping read pairs can impact somatic genotyping through haplotype assembly, potentially leading to instances where FAD values exceed AD values. I truly appreciate the time you took to share your expertise on this matter.
Best regards
Gero Doose
Please sign in to leave a comment.
6 comments