Here is a collection of questions related to Mutect2 that we frequently find asked on our GATK forum.
For more info on the Mutect2 tool, visit the Mutect2 tool index. For more info on the Mutect2 Best Practices, visit our Best Practices documentation.
1.
For GenomicsDBImport, what options do I include in -L (or --intervals) for whole genome contiguous datasets?
The GenomicsDBImport -L
(or --intervals
) option looks for one or more genomic intervals over which to operate. For example, if you wanted to use a 20 bp padding interval around the locus at chr1:100, you could use -L 1:80-120
. This is typically used to add padding around targets when analyzing exomes. If you have a lot of intervals you want to pad, a less cumbersome way to do this would be -L 1:100 -ip 20
(“ip” stands for “interval padding” and is an argument of all GATK tools).
More info is available at the GenomicsDBImport Tool Index.
2.
How does Mutect2 determine whether a given call is somatic or germline, if it's not in the included Resource file?
The germline resource is used to get the frequency of a variant allele in the population, thereby providing the prior probability that the sample carries the allele in the germline. This prior is one ingredient in a statistical model for germline variation. When an allele is missing from the germline resource, Mutect2 uses the same model with a very small imputed allele frequency.
More information about this is described in Section II E of the Mutect2 documentation.
3.
Does the size of an INDEL variant affect the VAF field?
No, the size of an INDEL variant should not have any effect of the VAF field.
4.
In Mutect2 (Tumor-only mode), is it possible to distinguish between likely germline and somatic mutations, while keeping both?
The tool does its best, but this isessentially impossible. Any genome will have several tens of thousands of germline variants so rare that they don’t appear in gnomAD, and there is hardly any way to distinguish these from somatic variants. The exception is if, due to low purity or other reasons, most somatic mutations in a sample have low allele frequencies. In such cases, FilterMutectCalls can distinguish them from germline hets.
5.
How is the Mutect2 filter different in tumor-only mode, versus in matched-normal mode? What does it do differently in each case?
In tumor-normal mode, Mutect2 detects germline variants using (1) the population allele frequency from the germline resource as a prior, (2) the normal reads, and (3) the allele fraction in the tumor (allele fractions near ½ are suggestive of germline hets). In tumor-only mode, the evidence from normal reads is missing, but it’s the same model. Additionally, in tumor-only mode, the powerful normal artifact filter is not available.
6.
Can I redefine what constistutes a Tumor/Normal sample while variant calling, for the purposes of analysis? For example, using a primary tumor sample as "normal" and a relapsed tumor as "tumor"?
Yes, this is possible.
7.
Can I change my TLOD (Tumor LOD) or NLOD (Normal LOD) filtering thresholds?
FilterMutectCalls has not filtered with thresholds on TLOD and NLOD for quite a long time. See Sections II C and II E of the Mutect2 documentation for details. There is currently a single parameter for adjusting sensitivity vs. specificity, the -f-score-beta
. See Section II A of the documentation for details.
8.
How does Mutect2 multi-sample mode work?
In Mutect2's multi-sample mode, normal reads are pooled within the memory. Even though the inputted BAMS are not merged, Mutect2's variant calling will treat them as if they were. (The only sign that they came from different bams is that they will have distinct genotype fields in the output VCF.) Tumor BAMs and tumor reads are not merged, but Mutect2 uses all reads at once in its local assembly. Mutect2 also genotypes all tumor samples jointly, which means that they share statistical power. For that reason, multi-sample mode is especially useful when there are several samples with low coverage or low variant allele fractions.
The output is a single VCF with one genotype field for each sample.
Both Mutect2 and FilterMutectCalls only make a single call for each variant (ie. a variant reported as PASS by FilterMutectCalls means that the evidence of all samples taken together suggests that it is a real somatic variant). A PASS call means that the variant is real and present in at least one tumor sample.
9.
How can I increase the sensitivity of my somatic variant calling?
In FilterMutectCalls you can increase the -f-score-beta parameter from its default of 1 to increase sensitivity at the expense of precision.
More information is available in the Mutect2 Tool Index
10.
How much RAM should I be allocating to run Mutect2? How many CPUs should I use?
You are discouraged from running Mutect2 using multithreaded options, and for most use-cases only 1 CPU should be enough. About 4GB of RAM should be sufficient for most simple exome sequencing samples.
To increase the speed of a Mutect2 workflow, consider running multiple instances of Mutect2 with e -L
(or --intervals
) option, with an included BED file of genomic regions. A target list of 10,000 regions could then be broken up into groups of 100-regions and run in parallel. In the mutect2 WDL this is achieved by setting the scatter_count parameter. We strongly recommend using the WDL / Terra for this because after scattering many files must be merged before filtering and the WDL handles this automatically.
11.
Is there a difference between the ways that Mutect2 and GATK4 identify active regions?
Yes. Mutect2 uses a quick approximation of the somatic likelihoods model for evaluating whether a pileup exhibits somatic variation, while HaplotypeCaller uses a germline genotyping model.
12.
Why does Mutect2 only annotate mutations? Why doesn't it actually filter out mutations that are likely false positives (i.e. not marked as PASS)?
The FILTER column as used within Mutect2 is part of the VCF spec. As such GATK tools (and many other software packages) know that anything with a value other than PASS is not a true variant. As far as the spec is concerned, adding to the FILTER column is filtering. A downstream tool that does not recognize this is faulty.
Funcotator has the --remove-filtered-variants
argument to omit non-PASS calls from its output, which may be useful if you want to remove likely false positives.
13.
What should I do when I have a high number of transversions after running Mutect2 and FilterMutectCalls on WES data? Could it be due to OxoG artifacts?
You should use the orientation bias artifact filter. See Section II G of the Mutect2 documentation for the necessary commands — LearnReadOrientationModel and FilterMutectCalls.
In the Mutect2 WDL/Terra workflow it is as simple as setting run_orientation_bias_mixture_model_filter
to true
.
For more information, check our articles on OxoG artifacts and this article in Nucleic Acids Research summarizing the issue.
14.
What types of variants does Mutect2 call?
Mutect2 only calls short variants, including SNA and indels, with filtering done to filter out artifacts but keep true variants.
If you want to find structural variants, you can look at this tool documentation.
If you have specific variants you do not want in your file, you can use the -XL
option to manually exclude sites.
15.
Is the Mutect2 Panel of Normals (PON) used to filter variants?
Mutect2 marks variants that are found in the PON with the “PON” info field, which FilterMutectCalls then uses for filtering. Additionally, Mutect2 considers variants in the PON as inactive by default (this can be changed with the -genotype-pon-sites argument), so most will be silently pre-filtered without ending up in the output. Some PON sites are output because they may appear in an active region containing a non-PON site.
16.
Why am I seeing unexpected variants in my dataset?
For detailed documentation on why certain variants are called, please consult out documentation on Mutect2 haplotype determination or any our Algorithms documentation.
17.
How do I create a Panel of Normals (PON) in cohort mode?
Cohort mode is for the CNV panel of normals, which is distinct from the Mutect2 panel of normals. This is not a Mutect2 question.
18.
What do --min-base-quality-score and --f1r2-min-bq mean? How are they different?
The --min-base-quality-score
is the minimum base quality for a base to be used in a kmer for assembly. For example, if the kmer size is 3 (obviously unrealistic) and we have a read with bases ATCGATTC, where every base except the G has sufficient quality, we will put the kmers ATC, ATT, and TTC on the graph but we will exclude TCG, CGA, and GAT.
The --f1-r2-min-bq
is the minimum base quality to be used for collecting orientation bias statistics when the -f1r2-tar-gz
option is turned on.
19.
What is the difference between the panel files in the Google Cloud Best Practices folder?
There are several different panel files located in the Google Cloud Best Practices folder.
"Mutect2-WGS-panel-b37.vcf" is a whole-genome panel, and "Mutect2-exome-panel.vcf" is an exome panel, both of which are generated from several hundred normals sequenced with standard Broad Genomocis Platform protocols. Most errors caught by the PON are mapping artifacts, so these are still useful, despite changes in sequencing technology.
"1000g_pon.hg38.vcf" is an hg38 panel of normals for both exomes and whole genomes generated from 1000 Genomes Project samples.
Finally, "af-only-gnomad.hg38.vcf" is a copy of the gnomAD VCF stripped of all unnecessary INFO fields. It is used for the -germline-resource
argument.
20.
Can I find common variants between ctDNA and primary tumour samples by performing HaplotypeCaller JointGenotyping?
No, it is not possible to do this because HaplotypeCaller genotypes diploid variants. Even if your primary tumor is pure, monoclonal, and has no CNVs, the low allele fractions in the ctDNA sample will confuse it.
Mutect2 is a better option for this. You can run it in multi-sample mode by specifying the -I
argument once for each sample.
21.
Can I find specific mutations in ctDNA by using Mutect2 on primary tumour samples as if it was a matched-normal?
Yes, this is possible.
If you have more specific questions about any of our tools that are not covered in this FAQ or in our documentation, please direct your questions to our GATK forum.
In the meantime, below you can find links to some of our most popular tutorials and articles related to Mutect2.
3 comments
Hi,
I had a strange occurrence that seems to be a glitch in the system somewhere. A week or so ago I had something crop up with Funcotator which led me to install v1.7 of the source materials. I then ran a set of files through our workflow that we have used for nearly a year an sent the results to a colleague to do her part of the process. She came back saying that a significant number (not all) of the 'Protein_Change' column items were missing. Thinking that maybe it was something amiss with v1.7 I re-funcotated with v1.6 and got the same results. So, I went back to an unaligned .bam and re-ran everything -- exactly the same results. In perusing the output there seemed to be a pattern in that Chr 1 seemed to be totally absent?
This led me to a bit further check extracted the missense SNPs with an AS_FilterStatus as 'SITE' from the .maf and tallied three columns that seemed to be problematic in the .maf. Transcript_Position, cDNA_Change, Codon_Change, Protein_change
Below are tab delimited columns and Nrows = number of SITEs in that particular chromosome.
Chr Nrows Transcript_Position cDNA_Change Codon_Change Protein_change
chr1 44 0 0 0 0
chr2 14 0 9 0 0
chr3 20 0 20 0 0
chr4 5 0 5 0 0
chr5 13 0 13 0 0
chr6 12 0 12 0 0
chr7 17 0 17 0 0
chr8 7 0 7 0 0
chr9 16 0 16 0 0
chr10 10 0 10 0 0
chr11 15 0 15 0 0
chr12 12 0 12 0 0
chr13 9 0 9 0 0
chr14 5 0 5 0 0
chr15 14 9 14 10 8
chr16 15 15 15 15 15
chr17 12 12 12 12 12
chr18 2 2 2 2 2
chr19 22 22 22 22 22
chr20 4 4 4 4 4
chr21 5 5 5 5 5
chr22 13 13 13 13 13
chrX 14 14 14 14 14
As you can see Chr1 is totally missing and the results are variable up to Chr15 and everything beyond that is okay.
I also did a manual xcheck of whether the 'ref_context' oligo could be found in the "Annotation_Transcript".. The ref_context oligo was NOT found the Annotation_Transcript nt sequence (downloaded from ESEMBL) in 194 of 300 SITEs. When it was found it wasn't always at the 10 nt offset (sometimes it was). I did not check the Refseq_mRNA.
I did a bit of manual xchecking of the .vcf generated by mutect2 and that seems okay, matching what is in the funcotator .maf.
I am not at all sure how to proceed? I can send the problem files.
Notes:
1) I am using the :latest docker version of GATK, running on Docker Desktop on a high-end Windows 10 workstation with 128GB RAM
2)I don't know whether there can be some kind of timing issue at google. Funcotator connects to google and there is a lot of traffic for an extended period on our slow DSL line in this rural area. I usually send it to run overnight. I don't know how it works, i.e. whether the entire mutect .vcf is pushed to google and then the output trickles back over time??
:-) we use anonymal to anonymize the data (random adjective + random animal) This is not a goat sequence!
## GATKCommandLine=<ID=Funcotator,CommandLine="Funcotator --output mydata/GBM_00067_NiceGoat/analysis/GBM_00067-92007_DT_NiceGoat_mutect2_funcotator_hg38_1.7.maf --ref-version hg38 --data-sources-path mydata/dataSourcesFolder/funcotator_dataSources.v1.7.20200521s/ --output-file-format MAF --variant mydata/GBM_00067_NiceGoat/analysis/GBM_00067-92007_DT_NiceGoat_mutect2_filtered_hg38.vcf --reference mydata/refs/Homo_sapiens_assembly38.fasta --verbosity ERROR --remove-filtered-variants false --five-prime-flank-size 5000 --three-prime-flank-size 0 --force-b37-to-hg19-reference-contig-conversion false --transcript-selection-mode CANONICAL --lookahead-cache-bp 100000 --min-num-bases-for-segment-funcotation 150 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false",Version="4.2.0.0",Date="June 2, 2021 7:17:41 PM GMT
Dear GATK team
I conducted GATK4 Mutect2 with "--tumor-lod-to-emit -10" and "--bam-output". When I was checking the BAM file, I recognized that some of the mismatches (variants) were not found in ArtificialHaplotypeRG. As long as I know, the BAM file from "--bam-output" is composed by 2 type of reads. The first one is the non-Artificial HaplotypeRG that is from raw read data and the another one is the Artificial HaplotypeRG that summarize the non-Artificial HaplotypeRG. I could not understand the reason why some of the mutation in non-Artificial HaplotypeRG failed to join the Artificial HaplotypeRG and failed to be recorded in VCF file. So, I'd appreciate it if you could tell me the reason or options that allow me to incorporate these dropped variants into the VCF.
Best regards,
Hello, I use the 'Scatter Gather' mode, and run mutect2 in parallel on separate chromosomes, but there is a consistency problem between the results obtained by using the 'Scatter Gather' mode and the results obtained by running without this mode. Excuse me, how to ensure consistent use of the 'Scatter Gather' mode.
https://github.com/broadinstitute/gatk/issues/8152
Please sign in to leave a comment.