Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

How does Mutect2 count read depth/coverage?

Answered
2

11 comments

  • Avatar
    David Benjamin

    Sameer Shankar The DP annotation is the number of reads passing filters such as minimal mapping quality that overlap a variant after local reassembly and are considered informative, that is, they support one allele better than another.  In most cases this is the same as overlap.

    You can enable the -bamout option to produce a bam file of Mutect2's local reassembly.  It's possible that the output depth will make sense in light of the bamout, which can be viewed in IGV like ordinary sequencing data.

    0
    Comment actions Permalink
  • Avatar
    Sameer Shankar

    When I produce a VCF file by running gatk AnnotateVcfWithBamDepth and apply each read filter individually, I seem to get the same sort of table each time, where the BAM depth shows up as 403 and the DP shows up as 71 (ideally the two values should be closer). How can I find out what (for example a particular filter), in GATK, is causing so many reads to be thrown out?

    I have also tried the -bamout option which has produced a BAM file that I have looked at in IGV. I can't seem to find anything that indicates why so many reads have been thrown out. I do also have the alternative BAM file where all 403 reads have been kept (this was obtained via a different pipeline). I was wondering if there was a way to compare them and find out the root cause of the difference between them?

    I have also used gatk AnnotateVcfWithBamDepth, disabled every filter, and DP = 71 still shows up for me.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    I wouldn't put much stock in AnnotateVcfWithBamDepth.  It's a bare-bones tool used only for internal validations.

    The most common reason for so many reads to be thrown out is if they marked as duplicates or if they are secondary or supplementary alignments.  Mapping quality is also possible if it's a difficult part of the genome.

    The GATK output (that is, the plain text logging output to the terminal) says how many reads were rejected by which read filters.  It ought to be able to tell you how 403 reads are whittled down to 71.  Keep in mind that once one read filter fails the others are not tested, so the counts only reflect the first filter to reject eah read.

    0
    Comment actions Permalink
  • Avatar
    Sameer Shankar

    I saw that the plain text logging output states that ~30,000 reads are thrown out by read length read filter, but in my case, ~1,000,000+ reads are getting thrown out. No other filter is throwing out any significant number of reads, which seems to suggest that it is something other than the filters which are causing so many reads to be thrown out.

    I have also checked for Mapping Quality in IGV, and that is definitely not the case (all of the reads have good quality). I have also checked for reads marked as duplicates or if they are secondary or supplementary alignments (I ran a samtools flagstat on the region).

    momac39:Downloads sshankar$ samtools flagstat CC-HAM-0383.qiaseq.reheadered.RG.fixmate.sorted.recal_spec_region.bam

    405 + 0 in total (QC-passed reads + QC-failed reads)

    0 + 0 secondary

    0 + 0 supplementary

    0 + 0 duplicates

    405 + 0 mapped (100.00% : N/A)

    405 + 0 paired in sequencing

    222 + 0 read1

    183 + 0 read2

    405 + 0 properly paired (100.00% : N/A)

    405 + 0 with itself and mate mapped

    0 + 0 singletons (0.00% : N/A)

    0 + 0 with mate mapped to a different chr

    0 + 0 with mate mapped to a different chr (mapQ>=5)
    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    You have ruled out a lot of possible explanations.  Is this amplicon sequencing, or some other protocol where a lot of reads start at the same position?

    If it's not that, could you post one bamout screenshot showing realigned reads and one screenshot showing the locally-assembled haplotypes (these should have the 'HP' read group)?

    0
    Comment actions Permalink
  • Avatar
    Sameer Shankar

    This is targeted panel sequencing (around ~250/300 regions). Below are the two BAM files,

    This is the BAM file after running mutect2

    This is the BAM file before running mutect2 (after completing all necessary data pre-processing)

    As you can see, the first file has a total read count of 77, whereas the second file has a total read count of 403. I have compared the two BAM files side-by-side, but I am not able to see anything that can explain the difference in the number of reads.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    It looks like a lot of reads start at exactly the same position in the original bam.  The GATK's default downsampler restricts to at most 50 reads starting at each position.  When start positions are random this is okay, but in your case it is not.  The -downsampling-stride argument fixes this.  For example, -downsampling-stride 20 downsamples to 50*20 reads starting anywhere in a 20-base window.  There's no harm in setting the stride as big as 100, and at this depth you could also turn off downsampling entirely with no ill effect.

    1
    Comment actions Permalink
  • Avatar
    Sameer Shankar

    Hi, I got the following error when trying to run the gatk Mutect2 command using:

    -stride 20
    or
    --downsampling-stride 20

    I was getting something similar when trying -dt NONE to remove the downsampling Mutect2 does, but I got the same error about NONE not being an argument that can be passed. Am I making a mistake? I read the Mutect2 documentation and this seems correct...

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    The -bamout argument needs to be followed by a path to the bamout file.  The argument parser is seeing "-stride" as the value of the -bamout argument, and then sees "20" as a standalone positional argument.  That error message could definitely be made more helpful. . .

    0
    Comment actions Permalink
  • Avatar
    Sameer Shankar

    Thank you very much! It seems to have worked and now my read depth is up to just over 390. I was wondering, was something about the -downsampling-stride argument and reads starting at the same position mentioned in the GATK paper or supplementary materials that I missed?

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    I'm glad that worked. You didn't miss anything.  We have never compiled a set of best practices for amplicon data.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk