How does Mutect2 count read depth/coverage?
AnsweredGATK version used: GATK 4.2.0.0
More of a generic question; how does Mutect2 count Read Depth (Coverage)? The VCF output from the Mutect2 function shows a read depth of 71, where we expect to see significantly more reads. I wanted to understand why this could be happening (no parameters were changed and no filters were added).
-
Sameer Shankar The DP annotation is the number of reads passing filters such as minimal mapping quality that overlap a variant after local reassembly and are considered informative, that is, they support one allele better than another. In most cases this is the same as overlap.
You can enable the -bamout option to produce a bam file of Mutect2's local reassembly. It's possible that the output depth will make sense in light of the bamout, which can be viewed in IGV like ordinary sequencing data.
-
When I produce a VCF file by running gatk AnnotateVcfWithBamDepth and apply each read filter individually, I seem to get the same sort of table each time, where the BAM depth shows up as 403 and the DP shows up as 71 (ideally the two values should be closer). How can I find out what (for example a particular filter), in GATK, is causing so many reads to be thrown out?
I have also tried the -bamout option which has produced a BAM file that I have looked at in IGV. I can't seem to find anything that indicates why so many reads have been thrown out. I do also have the alternative BAM file where all 403 reads have been kept (this was obtained via a different pipeline). I was wondering if there was a way to compare them and find out the root cause of the difference between them?
I have also used gatk AnnotateVcfWithBamDepth, disabled every filter, and DP = 71 still shows up for me. -
I wouldn't put much stock in AnnotateVcfWithBamDepth. It's a bare-bones tool used only for internal validations.
The most common reason for so many reads to be thrown out is if they marked as duplicates or if they are secondary or supplementary alignments. Mapping quality is also possible if it's a difficult part of the genome.
The GATK output (that is, the plain text logging output to the terminal) says how many reads were rejected by which read filters. It ought to be able to tell you how 403 reads are whittled down to 71. Keep in mind that once one read filter fails the others are not tested, so the counts only reflect the first filter to reject eah read.
-
I saw that the plain text logging output states that ~30,000 reads are thrown out by read length read filter, but in my case, ~1,000,000+ reads are getting thrown out. No other filter is throwing out any significant number of reads, which seems to suggest that it is something other than the filters which are causing so many reads to be thrown out.
I have also checked for Mapping Quality in IGV, and that is definitely not the case (all of the reads have good quality). I have also checked for reads marked as duplicates or if they are secondary or supplementary alignments (I ran a samtools flagstat on the region).
momac39:Downloads sshankar$ samtools flagstat CC-HAM-0383.qiaseq.reheadered.RG.fixmate.sorted.recal_spec_region.bam
405 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
405 + 0 mapped (100.00% : N/A)
405 + 0 paired in sequencing
222 + 0 read1
183 + 0 read2
405 + 0 properly paired (100.00% : N/A)
405 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5) -
You have ruled out a lot of possible explanations. Is this amplicon sequencing, or some other protocol where a lot of reads start at the same position?
If it's not that, could you post one bamout screenshot showing realigned reads and one screenshot showing the locally-assembled haplotypes (these should have the 'HP' read group)?
-
This is targeted panel sequencing (around ~250/300 regions). Below are the two BAM files,
This is the BAM file after running mutect2
This is the BAM file before running mutect2 (after completing all necessary data pre-processing)
As you can see, the first file has a total read count of 77, whereas the second file has a total read count of 403. I have compared the two BAM files side-by-side, but I am not able to see anything that can explain the difference in the number of reads.
-
It looks like a lot of reads start at exactly the same position in the original bam. The GATK's default downsampler restricts to at most 50 reads starting at each position. When start positions are random this is okay, but in your case it is not. The -downsampling-stride argument fixes this. For example, -downsampling-stride 20 downsamples to 50*20 reads starting anywhere in a 20-base window. There's no harm in setting the stride as big as 100, and at this depth you could also turn off downsampling entirely with no ill effect.
-
Hi, I got the following error when trying to run the gatk Mutect2 command using:
-stride 20
or
--downsampling-stride 20I was getting something similar when trying -dt NONE to remove the downsampling Mutect2 does, but I got the same error about NONE not being an argument that can be passed. Am I making a mistake? I read the Mutect2 documentation and this seems correct...
-
The -bamout argument needs to be followed by a path to the bamout file. The argument parser is seeing "-stride" as the value of the -bamout argument, and then sees "20" as a standalone positional argument. That error message could definitely be made more helpful. . .
-
Thank you very much! It seems to have worked and now my read depth is up to just over 390. I was wondering, was something about the -downsampling-stride argument and reads starting at the same position mentioned in the GATK paper or supplementary materials that I missed?
-
I'm glad that worked. You didn't miss anything. We have never compiled a set of best practices for amplicon data.
Please sign in to leave a comment.
11 comments