This tutorial is applicable to Mutect2 version 4.1.1.0 and higher. Post suggestions/questions in the GATK Community Forum
GATK 4.1.1.0 contains several major advances in the Mutect2 pipeline, which is good, but we have had to change command lines in a few places, which we usually try to avoid. Here are the major changes.
Bug Fixes
Fixes were made regarding several bugs (with error messages about invalid log probabilities, infinities, NaNs, etc.) that were introduced by neglecting to account for finite precision errors. We also resolved an issue where CalculateContamination worked poorly on very small gene panels.
New Required Inputs to FilterMutectCalls
1) FilterMutectCalls now requires a reference, which should be the same fasta file input to Mutect2.
2) FilterMutectCalls also requires a stats file from Mutect2. When you run Mutect2 with output -O unfiltered.vcf, for example, it produces a file called unfiltered.vcf.stats. FilterMutectCalls automatically looks for this file, so if you are running locally no change is needed. That is,
gatk Mutect2 -R ref.fasta -I tumor.bam -O unfiltered.vcf
followed by
gatk FilterMutectCalls -R ref.fasta -V unfiltered.vcf -O filtered.vcf
works because Mutect2 creates unfiltered.vcf.stats behind the scenes and FilterMutectCalls knows to look for it. However, if you are running on a cluster or the cloud you need to keep track of the stats file. For example, you need to delocalize it from a VM, as is done in the Mutect2 WDL. You can explicitly input the stats file with the -stats argument in FilterMutectCalls. If you are scattering Mutect2 over multiple nodes you must merge the stats files with MergeMutectStats:
gatk MergeMutectStats \ -stats unfiltered1.vcf.stats \ -stats unfiltered2.vcf.stats \ -O merged.stats
and pass merged.stats to FilterMutectCalls.
New Filtering Strategy in FilterMutectCalls
FilterMutectCalls now filters based on a single quantity, the probability that a variant is a somatic mutation. Previously, different reasons why this might not be the case each had their own thresholds. We have removed parameters such as -normal-artifact-lod
, -max-germline-posterior
, -max-strand-artifact-probability
, and -max-contamination-probability
. Even the previously fundamental -tumor-lod
is gone. Rather than replace these with a single threshold for the error probability, FilterMutectCalls replaces them with nothing at all. It automatically determines the threshold that optimizes the "F score", the harmonic mean of sensitivity and precision. In order to tweak results in favor of more sensitivity users may set -f-score-beta
to a value greater than its default of 1 (beta is the relative weight of sensitivity versus precision in the harmonic mean). Setting it lower biases results toward greater precision.
You can think of these changes as doing two things. Unifying filtering thresholds puts all the filters at the same place on a ROC curve of sensitivity vs precision. Previously, one threshold might be sacrificing a lot of sensitivity for a small gain in precision while another might be doing the opposite, the result being poor sensitivity and precision that fell below the potential ROC curve. Once we’re on that ROC curve, we achieve a good balance between sensitivity and precision by optimizing the F score.
New Somatic Clustering Model in FilterMutectCalls
We had long suspected that modeling the spectrum of subclonal allele fractions would help distinguish somatic variants from errors. For example, if every somatic variant in a tumor were a het occurring in 40% of cells, we would know to reject anything with an allele fraction significantly different from 20%. In the Bayesian framework of Mutect2 this means that the likelihood for somatic variants is given by a binomial distribution. We account for an unknown number of subclones with a Dirichlet process binomial mixture model. This is still an oversimplification because CNVs, small subclones, and genetic drift of passenger mutations all contribute allele fractions that don’t match a few discrete values. Therefore, we include a couple of beta-binomials in the mixture to account for a background spread of allele fractions while still benefiting from clustering. Finally, we use these binomial and beta-binomial likelihoods to refine the tumor log odds calculated by Mutect2, which assume a uniform distribution of allele fractions.
In addition to clustering allele fractions we also learn the overall prior probabilities of somatic SNVs and indels so that we can be more skeptical of apparent variants in a quiet tumor, for example. We learn the parameters of this model with a stochastic EM approach, where the E steps consist of Chinese Restaurant Process sampling of the allele fraction clusters. In case you were wondering, we have tested this new approach on some of the off-label, non-cancer uses of Mutect2, such as mitochondria, and it works very well.
FilterMutectCalls reports the learned parameters of somatic clustering in a new .filtering_stats.tsv output. This file also contains information on the probability threshold chosen to optimize the F score and the number of false positives and false negatives from each filter to be expected from this choice.
A step-by-step guide to the new Mutect2 Read Orientation Artifacts Workflow
If you suspect any of your samples of substitution errors that occur on a single strand before sequencing you should definitely use Mutect2's orientation bias filter. This applies to all FFPE tumor samples and samples sequenced on Illumina Novaseq machines, among others. In fact, with the optimizations in 4.1.1.0 you can run the filter even when you're not suspicious. It won't hurt accuracy and the CPU cost is now quite small.
There are three steps to the filter. First, run Mutect2 with the --f1r2-tar-gz
argument. This creates an output with raw data used to learn the orientation bias model. Previously this was done by CollectF1R2Counts. By absorbing it into Mutect2, we eliminated the cost of CollectF1R2Counts with almost no effect on the runtime of Mutect2. When multiple tumor samples are specified, you only need a single --f1r2-tar-gz
output, which contains data for each tumor sample.
gatk Mutect2 -R ref.fasta \ -L intervals.interval_list \ -I tumor.bam \ -germline-resource af-only-gnomad.vcf \ -pon panel_of_normals.vcf \ --f1r2-tar-gz f1r2.tar.gz \ -O unfiltered.vcf
Next, pass this raw data to LearnReadOrientationModel:
gatk LearnReadOrientationModel -I f1r2.tar.gz -O read-orientation-model.tar.gz
Run GetPileupSummaries to summarize read support for a set number of known variant sites.
gatk GetPileupSummaries \ -I tumor.bam \ -V chr17_small_exac_common_3_grch38.vcf.gz \ -L chr17_small_exac_common_3_grch38.vcf.gz \ -O getpileupsummaries.table
Estimate contamination with CalculateContamination.
gatk CalculateContamination \ -I getpileupsummaries.table \ -tumor-segmentation segments.table \ -O calculatecontamination.table
Finally, pass the learned read orientation model to FilterMutectCallswith the -ob-priors argument:
gatk FilterMutectCalls -V unfiltered.vcf \ [--tumor-segmentation segments.table] \ [--contamination-table contamination.table] \ --ob-priors read-orientation-model.tar.gz \ -O filtered.vcf
Advanced note: if you are scattering Mutect2 over nodes in a cluster or on the cloud, you must input the --f1r2-tar-gz output from each Mutect2 scatter to LearnReadOrientationModel. This is done automatically in the Mutect2 wdl in the gatk repo and on Terra. For example:
for chromosome in {1..22}; do gatk Mutect2 -R ref.fasta -I tumor.bam -L $chromosome --f1r2-tar-gz ${chromosome}-f1r2.tar.gz -O ${chromosome}-unfiltered.vcf done all_f1r2_input=`for chromosome in {1..22}; do printf -- "-I ${chromosome}-f1r2.tar.gz "; done` LearnReadOrientationModel $all_f1_r2_input -O read-orientation-model.tar.gz
A step-by-step guide to the new Mutect2 Panel of Normals Workflow
We rewrote CreateSomaticPanelOfNormals to use GenomicsDB for scalability. We also added the INFO fields FRACTION (the fraction of normal samples with an artifact) and BETA (the shape parameters of the beta distribution of artifact allele fractions among samples with the artifact) to the panel of normals. FilterMutectCalls doesn’t use these yet, but we hope to experiment with them in the near future. Furthermore, we never liked how germline variants, which are handled in a more principled way with our germline filter, ended up as hard-filtered pon sites, so the panel of normals workflow now optionally excludes germline events from its output, keeping only technical artifacts.
The three steps to create a panel of normals are:
Step 1: Run Mutect2 in tumor-only mode for each normal sample:
gatk Mutect2 -R reference.fasta -I normal1.bam --max-mnp-distance 0 -O normal1.vcf.gz gatk Mutect2 -R reference.fasta -I normal2.bam --max-mnp-distance 0 -O normal2.vcf.gz
... etc.
Step 2: Create a GenomicsDB from the normal Mutect2 calls:
gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list \ --genomicsdb-workspace-path pon_db \ -V normal1.vcf.gz \ -V normal2.vcf.gz \ -V normal3.vcf.gz
Step 3: Combine the normal calls using CreateSomaticPanelOfNormals:
gatk CreateSomaticPanelOfNormals -R reference.fasta \ --germline-resource af-only-gnomad.vcf.gz \ -V gendb://pon_db \ -O pon.vcf.gz
In the third step, the AF-only gnomAD resource is the same public GATK resource used by Mutect2 (when calling tumor samples, but not in Step 1, above).
19 comments
Can I please confirm that this is the logic being used in the current PoN creation step:
====
Any candidate PoN site must occur in at least 2 samples (as per the default for --min-sample-count).
If the site is not in gnomad (or at negligible frequency), it's automatically treated as a candidate site for PoN inclusion.
If a site IS in gnomad at a non-negligible frequency, then it computes the probability of it being a germline variant. If the p(germline) < 0.5, then it's a candidate. (this 0.5 is manually tunable via cmd line parameter: --max-germline-probability
====
and if this is true, can we update the code documentation at:
https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/mutect/CreateSomaticPanelOfNormals.java
to reflect that germline variants are not intended to be captured now during the PoN creation step?
Also, can you please confirm that the panel of normals is excluding variants based on chromosome position and not requiring allele-specific matching? ie. a G->A variant in the tumor sample may be filtered if there's a G->T entry at that chromosomal position in the panel of normals. I've seen examples of this and just want to confirm that this is the expected behavior. Thx in advance.
Hi, I just ran the GenomicsDBImport to my exome samples (17 Germline samples). it has run 3 days and create some tmp file occupying 1.5 Tb and its still on chr4! Is there any solution
Hi, when I run GetPileupSummaries , I get this error:
How to solve this problem
Hi Nickier, I had the same problem and figured out it's basically related to how contigs are passed from the BED file. If the BED (or interval_list as they call it) has a lot of different little intervals on every contig/chr you are analyzing, you'll see in the first lines of GenomicsDBImport output a warning about that. The solution is to insert --merge-input-interval in the GenomicsDBImport command
Hello, I have a question regarding the `-tumor-segmentation` option in CalculateContamination:
- gatk website states it "output table containing segmentation of the tumor by minor allele fraction"
- does this mean modeling possible mixture of 2 samples?
- this wasn't done in the tutorial for the last version (https://gatk.broadinstitute.org/hc/en-us/articles/360035889791?id=11136), where below is the command:
- is `-tumor-segmentation` a recommended option now? how does it help in calculating contamination?
another related question, what does the [ ] (square brackets) mean here? are they optional?
- if I run the above FilterMutectCalls, does it take care of both *cross-sample contamination* filtering and *orientation bias* filtering?
Hi, Enrico Cocchi , you are right. It makes effect when I add the argument --merge-input-interval . Thanks a lot.
when will gatk support calling complex variants, for example variants in EGFR,I do not know after so many versions of updating of gatk, this requirement is still not satisfied
Panel-of-Normals file generation, GenomicsDBImport file merge, I got the following error message several times:
[E::bcf_get_info_values] Trying to get 32-bit int data from a field which contains 64 bit values
Do you have any clue why?
Hi, can you please explain what's exactly the role of -tumor-segmentation segments.table in CalculateContamination and how we're supposed to generate such table?
Hi, I got this error when I add more than 2 vcf files for GenomicsDBImport.
This is my code and I am not sure why it is recognized as duplicate files.
I added my vcf files this way:
-V HDF1_TA1_S4_normal.vcf.gz \
-V HDF1_TA2_S5_normal.vcf.gz \
-V HDF1_TA3_S6_normal.vcf.gz \
-V HDF1_UT1_S1_normal.vcf.gz
And, this is the error I got:
A USER ERROR has occurred: Duplicate sample: 1. Sample was found in both file:///gpfs/research/medicine/sequencer/NovaSeq/Outputs_fastq/2020_Outputs/Akash_Gunjan_05-19-2020_Yuna-samples/GATK_Bamfiles/Preprocessed_Bam/PON/HDF1_TA2_S5_normal.vcf.gz and HDF1_TA1_S4_normal.vcf.gz
The FilterMutectCalls requires a reference argument. But this is omitted in the above commands. Is it deliberate or it can be added.
I could not understand this new pipeline of Mutect2 (4.1.1.0, or newer).
My samples are tumor with matched normal tissue, WGS. Ref is hg38. Tissue type is FFPE.
Could anyone tell me how to call somatic event?
My understanding is like this (let's say i have 3 paired samples) Looks wired:
I use this I use this command to generate pon.vcf.gz,
"gatk CreateSomaticPanelOfNormals -R /home/dlin/reference/bwa-ref-hg19/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --germline-resource af-only-gnomad.vcf.gz -V gendb://pon_db -O pon.vcf.gz"
but i can not find "af-only-gnomad.vcf.gz" file for hg19, can anybody tell me how to dowload this file?
I have a problem with PoN generation Step 2: GenomicsDBImport step creates EMPTY databases from Mutect2 called VCFs: here the link to the Biostar post I created about this. Please comment if you have any idea about this!
Hello, I use the 'Scatter Gather' mode, and run mutect2 in parallel on separate chromosomes, but there is a consistency problem between the results obtained by using the 'Scatter Gather' mode and the results obtained by running without this mode. Excuse me, how to ensure consistent use of the 'Scatter Gather' mode.
https://github.com/broadinstitute/gatk/issues/8152
Hi GATK Team,
I am working on somatic variant calling for my WES data. We have created our own panel of normals and I followed all the above listed steps to create panels of normals , Mutect2 calling and Filtermutectcall.
In filtering stage, I did two types of filtering: 1: one with basic filtering where Filtermutectcall was run with our orientation model, contamination table, 2: and second filtering with contamination table, orientation model and I called this Advanced Filtering etc.
I have few questions:
1: Why mutect2 is calling somatic variants which as VAF close to 1? I generated a distribution of VAF for my samples and I saw two spikes: 1 spike between 0 to 0.25 VAF and another spike between 0.9-1VAF. Why mutect2 is calling somatic variants at 0.9-1 VAF range? Here is the example distribution graph.
2: With Advanced filtering (Filtermutectcall with contamination table, orientation model etc), I found few samples have a spike at 0.5 VAF. Why mutetc2 calling somatic variants with 0.5VAF and all those variants are on CHR6 near HLA region. I used GRCH38 ref and it has alt coatings in it.
I will really appreciate your response.
Thanks
Zeeshan
There is a typo in above comment: "In filtering stage, I did two types of filtering: 1: one with basic filtering where Filtermutectcall was run with our orientation model, contamination table"
correction:
In filtering stage, I did two types of filtering: 1: one with basic filtering where Filtermutectcall was run without orientation model, contamination table"
I have a question about how to remove germline mutations from somatic vcf files caused by Mutect2 tool, is it based on AF information? How to choose the threshold?The command I used is "gatk Mutect2 -R Homo_sapiens_assembly38.fasta -I xx.bam --germline-resource somatic_hg38_af-only-gnomad.hg38.vcf.gz --panel-of-normals somatic_hg38_1000g_pon.hg38.vcf.gz -O xx.vcf". These somatic_hg38_af-only-gnomad.hg38.vcf.gz and somatic_hg38_1000g_pon.hg38.vcf.gz I downloaded from the google cloud provided by gatk. Has Mutect2 considered the AF frequency to annotate when performing the inclusion of germline parameters? Will there be any overlap between this annotation for germline mutations and actual somatic mutations. Can I assume that this is a germline mutation if it is labeled germline after mutect2, and can I just filter out the loci labeled as germline mutations and consider the remaining loci as possible somatic mutations?
Please sign in to leave a comment.