Mutect2: Incorrectly computed allelic depths and likely false-positive calls
I am running Mutect2 on Illumina amplicon data and observe that the computed allelic depths don't correspond to what I see in IGV or via samtools mpileup. Importantly, this seems to strongly affect the underlying likelihood calculations.
For example, Mutect2 reports:
contig1 932 . A G . . DP=552;ECNT=2;MBQ=37,38;MFRL=480,480;MMQ=60,60;MPOS=23;NALOD=2.27;NLOD=54.79;POPAF=6.00;TLOD=3.39 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:186,0:5.343e-03:186:85,0:91,0:25,161,0,0 0/1:350,3:0.012:353:151,1:179,2:55,295,0,3
IGV gives me for the tumor BAM:
contig1:932
A : 521 (99%, 214+, 307- )
C : 1 (0%, 1+, 0- )
G : 4 (1%, 1+, 3- )
T : 0
N : 0
... and for the normal BAM:
contig1:932
A : 277 (99%, 106+, 171- )
C : 0
G : 2 (1%, 0+, 2- )
T : 0
N : 0
So a) it seems that reported allelic depths are off, b) the "lost alleles" seem to include the 'G' alleles present in the normal BAM, which makes the 'G' alleles in the tumor BAM seem tumor-specific.
Any advice? I am using --max-reads-per-alignment-start 0, only a small number of reads is filtered out, and mapping and base qualities are very high in the region.
Filter log:
Mutect2 - 6 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappingQualityNotZeroReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonChimericOriginalAlignmentReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: ReadLengthReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
6 total reads filtered
Full command:
/gpfs/project/dilthey/software/gatk-github/gatk --java-options "-Xmx4g" Mutect2 -R ref.fasta -I rg_ref_590046.bam --tumor-sample tumor_590046 -I rg_ref_595665.bam --normal-sample healthy_595665 -O test.vcf --max-reads-per-alignment-start 0
Version:
The Genome Analysis Toolkit (GATK) v4.1.4.1-71-gbd68edb-SNAPSHOT
HTSJDK Version: 2.21.1
Picard Version: 2.21.7
-
Thanks for posting. I am looking into this now.
-
Can you take a look at Mutect2's bamout for comparison? What do those results look like?
-
Mutect2 emits unfiltered calls. Given the low TLOD, this is very unlikely to pass FilterMutectCalls.
-
Thank you very much for looking into this!
This is what the .bamout says:
contig1:932
Total count: 564
A : 559 (99%, 80+, 479- )
C : 0
G : 5 (1%, 0+, 5- )
T : 0
N : 0
The variant survives filtering with,/gpfs/project/dilthey/software/gatk-github/gatk --java-options "-Xmx4g" FilterMutectCalls -R ref.fasta -V test.vcf -O test.filtered.vcf
leading to:
contig1 932 . A G . PASS CONTQ=93;DP=552;ECNT=2;GERMQ=93;MBQ=37,38;MFRL=480,480;MMQ=60,60;MPOS=23;NALOD=2.27;NLOD=54.79;POPAF=6.00;SEQQ=15;STRANDQ=18;TLOD=3.39 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:186,0:5.343e-03:186:85,0:91,0:25,161,0,0 0/1:350,3:0.012:353:151,1:179,2:55,295,0,3
-
FilterMutectCalls learns features from all variants simultaneously. Could you try running it on your entire callset and check what result it gives for this site? If you only run it on one site, the tool 1) sees that it's questionable (hence the lowish SEQQ of 15) but retains it because it's desperate not to drop the only possible variant and end up with zero sensitivity and 2) learns an outlandishly large prior of somatic variation because the ratio of possible variants to covered territory is large. [Italicizing to draw the attention of anyone who might read this thread:] It is always best to run FilterMutectCalls on the Mutect2 calls from your entire bam.
To analyze the bamout, let's start with a few screenshots of reads that support the G allele in the original bam, but we'll need to see those reads in the original bam and the bamout (you can search by read name in IGV to highlight a read). Zooming out to roughly a ~100 base window should be about right. It is especially important to see this for the 2 reads with the G in the original normal sample. And it is critical to get the reference sequence around the site in the screenshot, too.
-
Thank you! I have run the FilterMutectCalls command on all HLA-B reads present in a given tumor/normal pair. I could combine this with BAMs representing other HLA genes. I note that, due to the diversity of the HLA, it seems to call quite a lot of variation in general - perhaps the properties of the HLA region mess with the filtering algorithm.
I attach screenshots that show the G-carrying reads in the original BAM. Interestingly, they are not present in the bamout.
-
Oh boy, HLA. You are brave.
About the variant in question, I may agree with Mutect2 for dropping the normal G reads. There's a lot of other stuff in those reads that doesn't show up elsewhere, and a mapping artifact seems a likelier explanation than just a lot of sequencing errors. That doesn't necessarily mean that Mutect2 is right about the tumor reads however, so more screenshots, this time of a couple tumor G reads in the bamout and original bam, would help further.
Probably the best practice for HLA if you have the energy is to do a de novo assembly of the normal HLA and then realign reads to that reference (or, I suppose, two references, since we're diploid) in order to reduce mapping error. The sheer number of competing HLA contigs in hg38 makes aligning to it very difficult.
Assuming you don't want to go down that rabbit hole, your goal here is to reduce mapping errors somehow. You will definitely want to run Mutect2 (and re-run FilterMutectCalls) with a panel of normals, which we keep in a public google bucket at gs://gatk-best-practices/
somatic-hg38/1000g_pon.hg38. vcf.gz (and gs://gatk-best-practices/ somatic-hg38/1000g_pon.hg38. vcf.gz.tbi for the index). I hope that this panel includes the HLA. Running FilterAlignmentArtifacts may also help reduce false positives. It's also worth ruling out contamination, so it would help to run CalculateContamination and feed that into FilterMutectCalls. In the HLA, something that looks like a mapping error may be a correctly-mapped contaminating read from someone with a different HLA haplotype.
I tentatively think that combining calls from different HLA genes is a good idea. Just be sure to either call all the bams in a single invocation of Mutect2 or to call them separately and then run MergeMutectStats on the .vcf.stats files produced by Mutect2 before filtering.
-
Hi David,
First of all, I made a mistake (IGV downsampling turned on). It turns out that one of the 'normal' reads is actually in the bamout - screenshot:
Second, here are screenshots showing three 'tumor' reads, two of which are present in the bamout (for the other read, only its mate pair is present in the bamout):
Yes, the process we currently carry out is similar in some respects to what you propose - so we're using personalized references to determine which HLA gene a read belongs to (the patient HLA types are known), and then we align all reads assigned to gene X to the chromosome 6 version of gene X (to maintain high mapping qualities, which might otherwise suffer due to sequence homologies between the two patient alleles for X). The BAMs that the examples I have shown are taken from has been created using this approach.
Yes, I believe that there may be a level of residual mis-alignment - but looking at the specific reads I have taken screenshots from, it's not clear to me that these reads are actually affected by this.
I have to admit I don't fully understand the content of bamout - so it seems that some form of downsampling has been carried out in determining which reads end up in this file, but it's not clear to me which criteria have been applied. Are there any parameters we could try to set to increase the set of reads taken into account during this step?
Thank you very much for your help!
-
Alexander Dilthey There aren't good parameters to control the bamout, but I have a couple of guesses that might help here. First, could you try adding `-linked-de-bruijn-graph` to your Mutect2 command? Second, could you try `-padding-around-snps 75`? It can't hurt to try them together, too. By the way, is your normal an amplicon sample, too?
-
Hi David,
Yes, the normal is based on exactly the same amplicon / PCR protocol.
Results, focusing on position 932:
Baseline (this and all further results using default arguments, apart from --max-reads-per-alignment-start 0)
HLA-B*07_02_01_01 932 . A G . . DP=552;ECNT=2;MBQ=37,38;MFRL=480,480;MMQ=60,60;MPOS=23;NALOD=2.27;NLOD=54.79;POPAF=6.00;TLOD=3.39 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:186,0:5.343e-03:186:85,0:91,0:25,161,0,0 0/1:350,3:0.012:353:151,1:179,2:55,295,0,3
-linked-de-bruijn-graph -padding-around-snps 75
HLA-B*07_02_01_01 932 . A G . . DP=553;ECNT=1;MBQ=37,37;MFRL=480,492;MMQ=60,60;MPOS=33;NALOD=2.26;NLOD=54.00;POPAF=6.00;TLOD=3.40 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:186,1:5.468e-03:187:85,0:91,0:25,161,0,1 0/1:349,3:0.012:352:151,1:179,2:55,294,0,3
-padding-around-snps 75
HLA-B*07_02_01_01 932 . A G . . DP=552;ECNT=2;MBQ=37,38;MFRL=480,480;MMQ=60,60;MPOS=23;NALOD=2.27;NLOD=54.79;POPAF=6.00;TLOD=3.39 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:186,0:5.343e-03:186:85,0:91,0:25,161,0,0 0/1:350,3:0.012:353:151,1:179,2:55,295,0,3
-linked-de-bruijn-graph
HLA-B*07_02_01_01 932 . A G . . DP=553;ECNT=1;MBQ=37,37;MFRL=480,492;MMQ=60,60;MPOS=33;NALOD=2.26;NLOD=54.00;POPAF=6.00;TLOD=3.40 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:186,1:5.468e-03:187:85,0:91,0:25,161,0,1 0/1:349,3:0.012:352:151,1:179,2:55,294,0,3
So this is interesting - with the linked graph turned on it seems that the caller is now at least seeing one on the G alleles in the normal. Filtering, however, still classifies the call as "PASS", and the overall allelic depths still seem off.
I'd be intrigued to learn more about what these parameters do - https://gatk.broadinstitute.org/hc/en-us/articles/360036730411-Mutect2 does not mention them, but refers to a list of "inherited arguments", which I struggle to find...
Any other options to try? I *may* be able to share the BAM with you if helpful at all, but would have to discuss with clinical collaborators first... so let me know if this would be useful!
-
I will try to think of more arguments to try, but I suspect I will need to end up asking for the bam.
-linked-de-bruijn-graph turns on a new feature as described in this paper: https://academic.oup.com/bioinformatics/article/34/15/2556/4938484. We expect to make it the default for Mutect2 soon. Briefly, it restores phasing and cycle information to the de Bruijn graph so that you get connectivity past the kmer size.
-padding-around-snps controls the size of windows for Pair-HMM alignment of reads to assembled haplotypes.
Please sign in to leave a comment.
11 comments