(GATK v4.1.6.0) Mutect2 bamout reports fewer reads than VCF AD
I've read many posts regarding differences in VCF allele depth and original (and bamout) allele depths; however, these discussions seem to be mostly about the read filtering that Mutect performs in the VCF AD reporting (i.e. the ADs in VCF are typically expected to be lower than those bamout). However...
What I'm seeing is that the VCF is reporting a higher (FORMAT field) AD and RD than I am seeing in the bamout files. The differences are very small, but I wanted to get to the heart of the issue. I thought the bamouts were supposed to contain the reads that passed all of the filters exactly (i.e. if the reads were reported in the VCF AD fields, then they passed all of the filters, and they would propagate through to the bamout)?
Here is an example an affected read:
chr1 22838258 . G GAATGGAAT . PASS CONTQ=93;DP=1047;ECNT=1;GERMQ=93;MBQ=39,20;MFRL=237,115;MMQ=60,60;MPOS=17;NALOD=1.67;NLOD=40.49;POPAF=6;ROQ=93;SEQQ=82;STRANDQ=93;TLOD=10.21 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:171,0:0.007178:171:83,0:85,0:105,66,0,0 0/1:847,5:0.004191:852:392,4:432,0:594,253,3,2
Here are the IGV coverage boxes for the position and position+1 (to show the insert coverage metrics):
I have also confirmed that there are only 4 reads with the insertions and 845 reads with reference in the bamout using both samtools tview and bam-readcount, respectively.
I have disabled downsampling as well, so I don't think there should be problem with downsampling of reads. But, I'm not entirely sure how the bamouts are generated in full.
Here is the full Mutect2 call (leaving bash variable names in place) for for info:
${RUN_GATK} Mutect2 \
-R ${REF_FILE} \
-O ${OUTPUT_VCF} \
-I ${TUMOR_PREPROCESSED_BAM} \
-tumor=${TUMOR_FILE_PREFIX} \
-I ${NORM_PREPROCESSED_BAM} \
-normal=${NORM_FILE_PREFIX} \
-L ${INTERVALS} \
--max-reads-per-alignment-start 0 \
--max-mnp-distance 0 \
--bamout ${OUTPUT_BAMOUT} \
--germline-resource ${GNOMAD} \
--panel-of-normals ${PANEL_OF_NORMALS} \
--f1r2-tar-gz ${OUTPUT_F1F2}
This discrepancy exists using Mutect v4.1.4.1 as well
-
- So the VCF says there is one more read supporting than the bamout. Looking at the code, one of the transformations in the AD annotation code is subsetting the likelihoods matrix to just the emitted alleles. It’s possible that in the bamout the 5th read is being aligned to a different allele that didn’t pass the emission threshold, while in the AD it’s assigned to the insertion since that allele was removed from consideration.
- 5 out of 847 reads for a big indel like that is enough to call. The likelihoods model (as opposed to the filters) only considers sequencing error, and the chance of 5 reads independently having the same big insertion in Illumina sequencing is negligible.
Please sign in to leave a comment.
1 comment