1. For GenomicsDBImport, what options do I include in -L (or --intervals) for whole genome contiguous datasets?
GenomicsDBImport -L (or
--intervals) option looks for one or more genomic intervals over which to operate. For example, if you wanted to use a 20 bp padding interval around the locus at chr1:100, you could use
-L 1:80-120. This is typically used to add padding around targets when analyzing exomes. If you have a lot of intervals you want to pad, a less cumbersome way to do this would be
-L 1:100 -ip 20 (“ip” stands for “interval padding” and is an argument of all GATK tools).
More info is available at the GenomicsDBImport Tool Index.
2. How does Mutect2 determine whether a given call is somatic or germline, if it's not in the included Resource file?
The germline resource is used to get the frequency of a variant allele in the population, thereby providing the prior probability that the sample carries the allele in the germline. This prior is one ingredient in a statistical model for germline variation. When an allele is missing from the germline resource, Mutect2 uses the same model with a very small imputed allele frequency.
More information about this is described in Section II E of the Mutect2 documentation.
3. Does the size of an INDEL variant affect the VAF field?
No, the size of an INDEL variant should not have any effect of the VAF field.
4. In Mutect2 (Tumor-only mode), is it possible to distinguish between likely germline and somatic mutations, while keeping both?
The tool does its best, but this isessentially impossible. Any genome will have several tens of thousands of germline variants so rare that they don’t appear in gnomAD, and there is hardly any way to distinguish these from somatic variants. The exception is if, due to low purity or other reasons, most somatic mutations in a sample have low allele frequencies. In such cases, FilterMutectCalls can distinguish them from germline hets.
5. How is the Mutect2 filter different in tumor-only mode, versus in matched-normal mode? What does it do differently in each case?
In tumor-normal mode, Mutect2 detects germline variants using (1) the population allele frequency from the germline resource as a prior, (2) the normal reads, and (3) the allele fraction in the tumor (allele fractions near ½ are suggestive of germline hets). In tumor-only mode, the evidence from normal reads is missing, but it’s the same model. Additionally, in tumor-only mode, the powerful normal artifact filter is not available.
6. Can I redefine what constistutes a Tumor/Normal sample while variant calling, for the purposes of analysis? For example, using a primary tumor sample as "normal" and a relapsed tumor as "tumor"?
Yes, this is possible.
7. Can I change my TLOD (Tumor LOD) or NLOD (Normal LOD) filtering thresholds?
FilterMutectCalls has not filtered with thresholds on TLOD and NLOD for quite a long time. See Sections II C and II E of the Mutect2 documentation for details. There is currently a single parameter for adjusting sensitivity vs. specificity, the
-f-score-beta. See Section II A of the documentation for details.
8. How does Mutect2 multi-sample mode work?
In Mutect2's multi-sample mode, normal reads are pooled within the memory. Even though the inputted BAMS are not merged, Mutect2's variant calling will treat them as if they were. (The only sign that they came from different bams is that they will have distinct genotype fields in the output VCF.) Tumor BAMs and tumor reads are not merged, but Mutect2 uses all reads at once in its local assembly. Mutect2 also genotypes all tumor samples jointly, which means that they share statistical power. For that reason, multi-sample mode is especially useful when there are several samples with low coverage or low variant allele fractions.
The output is a single VCF with one genotype field for each sample.
Both Mutect2 and FilterMutectCalls only make a single call for each variant (ie. a variant reported as PASS by FilterMutectCalls means that the evidence of all samples taken together suggests that it is a real somatic variant). A PASS call means that the variant is real and present in at least one tumor sample.
9. How can I increase the sensitivity of my somatic variant calling?
In FilterMutectCalls you can increase the -f-score-beta parameter from its default of 1 to increase sensitivity at the expense of precision.
More information is available in the Mutect2 Tool Index
10. How much RAM should I be allocating to run Mutect2? How many CPUs should I use?
You are discouraged from running Mutect2 using multithreaded options, and for most use-cases only 1 CPU should be enough. About 4GB of RAM should be sufficient for most simple exome sequencing samples.
To increase the speed of a Mutect2 workflow, consider running multiple instances of Mutect2 with e
--intervals) option, with an included BED file of genomic regions. A target list of 10,000 regions could then be broken up into groups of 100-regions and run in parallel. In the mutect2 WDL this is achieved by setting the scatter_count parameter. We strongly recommend using the WDL / Terra for this because after scattering many files must be merged before filtering and the WDL handles this automatically.
11. Is there a difference between the ways that Mutect2 and GATK4 identify active regions?
Yes. Mutect2 uses a quick approximation of the somatic likelihoods model for evaluating whether a pileup exhibits somatic variation, while HaplotypeCaller uses a germline genotyping model.
12. Why does Mutect2 only annotate mutations? Why doesn't it actually filter out mutations that are likely false positives (i.e. not marked as PASS)?
The FILTER column as used within Mutect2 is part of the VCF spec. As such GATK tools (and many other software packages) know that anything with a value other than PASS is not a true variant. As far as the spec is concerned, adding to the FILTER column is filtering. A downstream tool that does not recognize this is faulty.
Funcotator has the
--remove-filtered-variants argument to omit non-PASS calls from its output, which may be useful if you want to remove likely false positives.
13. What should I do when I have a high number of transversions after running Mutect2 and FilterMutectCalls on WES data? Could it be due to OxoG artifacts?
You should use the orientation bias artifact filter. See Section II G of the Mutect2 documentation for the necessary commands — LearnReadOrientationModel and FilterMutectCalls.
In the Mutect2 WDL/Terra workflow it is as simple as setting
14. What types of variants does Mutect2 call?
Mutect2 only calls short variants, including SNA and indels, with filtering done to filter out artifacts but keep true variants.
If you want to find structural variants, you can look at this tool documentation.
If you have specific variants you do not want in your file, you can use the
-XL option to manually exclude sites.
15. Is the Mutect2 Panel of Normals (PON) used to filter variants?
Mutect2 marks variants that are found in the PON with the “PON” info field, which FilterMutectCalls then uses for filtering. Additionally, Mutect2 considers variants in the PON as inactive by default (this can be changed with the -genotype-pon-sites argument), so most will be silently pre-filtered without ending up in the output. Some PON sites are output because they may appear in an active region containing a non-PON site.
16. Why am I seeing unexpected variants in my dataset?
17. How do I create a Panel of Normals (PON) in cohort mode?
Cohort mode is for the CNV panel of normals, which is distinct from the Mutect2 panel of normals. This is not a Mutect2 question.
18. What do --min-base-quality-score and --f1r2-min-bq mean? How are they different?
--min-base-quality-score is the minimum base quality for a base to be used in a kmer for assembly. For example, if the kmer size is 3 (obviously unrealistic) and we have a read with bases ATCGATTC, where every base except the G has sufficient quality, we will put the kmers ATC, ATT, and TTC on the graph but we will exclude TCG, CGA, and GAT.
--f1-r2-min-bq is the minimum base quality to be used for collecting orientation bias statistics when the
-f1r2-tar-gz option is turned on.
19. What is the difference between the panel files in the Google Cloud Best Practices folder?
There are several different panel files located in the Google Cloud Best Practices folder.
"Mutect2-WGS-panel-b37.vcf" is a whole-genome panel, and "Mutect2-exome-panel.vcf" is an exome panel, both of which are generated from several hundred normals sequenced with standard Broad Genomocis Platform protocols. Most errors caught by the PON are mapping artifacts, so these are still useful, despite changes in sequencing technology.
"1000g_pon.hg38.vcf" is an hg38 panel of normals for both exomes and whole genomes generated from 1000 Genomes Project samples.
Finally, "af-only-gnomad.hg38.vcf" is a copy of the gnomAD VCF stripped of all unnecessary INFO fields. It is used for the
20. Can I find common variants between ctDNA and primary tumour samples by performing HaplotypeCaller JointGenotyping?
No, it is not possible to do this because HaplotypeCaller genotypes diploid variants. Even if your primary tumor is pure, monoclonal, and has no CNVs, the low allele fractions in the ctDNA sample will confuse it.
Mutect2 is a better option for this. You can run it in multi-sample mode by specifying the
-I argument once for each sample.
21. Can I find specific mutations in ctDNA by using Mutect2 on primary tumour samples as if it was a matched-normal?
Yes, this is possible.
If you have more specific questions about any of our tools that are not covered in this FAQ or in our documentation, please direct your questions to our GATK forum.
In the meantime, below you can find links to some of our most popular tutorials and articles related to Mutect2.