Active Regions Mutect2 Theory
Hi,
We're currently trying to troubleshoot why GATK Mutect2 failed spectacularly with calling somatic variants.
Im trying to call variants on an enrichment-style libraries. Essentially regions have different levels of coverage, and like 2% of the genome have really high areas of coverage (500+ read depth) and when looking at these high coverage regions mutect2 misses the variants that we've confirmed are there both in the lab and via samtools mpileup.
Due to the design of the library of areas of high enrichment surrounded by areas of low enrichment, we were wondering if the active regions calculation could be why this is happening.
Several questions stemming from the biorxiv paper:
1. "It computes an “active probability” for each locus" - What is the region size of a locus? is it a specific kmer length?
2. What is the function to calculate active probability? Is it more than mismatch?
3. "An ActiveRegion is thus defined as an interval of contiguous loci where the active probability exceeds a certain threshold (0.002 by default.)" does every base in the region have to exceed the threshold or is it the average across the area?
-
Mutect2 uses entrophy based active region detection which you can read more about its mathematical details in the document below
https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf
For those missing mutations in mutect2 there are bunch of parameters that you may modify to change the sensitivity of calling.
--initial-tumor-lod 0.0
--tumor-lod-to-emit 0.0These 2 parameters are used in our mitochondria-mode for sensitive calling for ultra rare variants which can also be used for high depth tumor data for increased sensitivity.
We also suggest users to disable downsampling by setting
--max-reads-per-alignment-start 0
to ensure all reads are considered not a portion of them for very high coverage regions.
We also have additional recommendations in the document below for missing variants.
Finally we can also recommend you to try our pileup detection algorithm which can utilize both the aligned reads data along with the reassembled haplotypes for a better hmm formation
--pileup-detection <Boolean> If enabled, the variant caller will create pileup-based haplotypes in addition to the
assembly-based haplotype generation. Default value: false. Possible values: {true, false}
--use-pdhmm <Boolean> Partially Determined HMM, an alternative to the regular assembly haplotypes where we
instead construct artificial haplotypes out of the union of the assembly and pileup
alleles. Default value: false. Possible values: {true, false}Make sure you use both of these parameters to enable pileup detection mode to its fullest. PDHMM code is currently not accelerated and may be slow but if intervals are run in parallel scatters you may gain some speed to make up for the missing performance. PDHMM enables Mutect2 to integrate all variants detected in the aligned reads into the reassembly graph to ensure generation of better graphs for variant detection.
I hope these would help solve your problem.
Regards.
-
Hi Gökalp Çelik ,
Thank you. For the Mutect2 paper, section B, I have a few questions:
1. The equation below seems to be the only instance of bernouli entropy. My understanding is that this equation calculates the log odds of one of the two cases: that an alt allele exists or that only ref alleles exist. That would assume this is applied per base not per region. Am I mistaken?
2. If my assumption in 1 is true, how is active region detection calculated?
3. Could variability in sequencing depth even at high levels (ie peaks that vary from 400-1000+ reads) affect active region calculation? -
Hi again.
Yes you are absolutely right that lod is calculated per base not per region. Mutect2/HaplotypeCaller walks over the genome with a given window size (which can be extended based on assembly region extension parameter) and a single active base is enough to set the region as active. Once a region has sites greater than the
--initial-tumor-lod
default then Mutect2 starts detecting variants within the active region. Setting this variable 0 practically makes it quite permissive for Mutect2 to become more sensitive. Once reassembly and pairHMM is done sites with lod larger than the default value of
--tumor-lod-to-emit
are emitted as variation in the VCF along with many other parameters calculated based on PoN, germline resource, matched normal, read orientation, sequencing artifact detection and PCR artifact detection, contamination, segmentation etc..
Too high depths will reduce the initial tumor lod or the calculated tumor lod for ultra rare sites below the default threshold therefore it will limit the detection sensitivity for Mutect2. However if ultra rare allele fractions are a concern and needs detection we recommend setting the lod to 0.0. Beware that this also will bring more false positives along with rare variants.
I hope this helps.
Please sign in to leave a comment.
3 comments