This page explains what Funcotator is and how to run it.
Table of Contents
- Funcotator Background Information
• 1.1 Data Sources
◦ 1.1.1 Data Source Folders
◦ 1.1.2 Pre-Packaged Data Sources
☞ 1.1.2.1 Downloading Pre-Packaged Data Sources
☞ 1.1.2.2 gnomAD
☞ 1.1.2.2.1 Enabling gnomAD
☞ 1.1.2.2.2 Included gnomAD Fields
◦ 1.1.3 Data Source Downloader Tool
◦ 1.1.4 Disabling Data Sourcesl
◦ 1.1.5 User-Defined Data Sources
☞ 1.1.5.1 Configuration File Format
☞ 1.1.5.1.1 Simple XSV Config File Example
☞ 1.1.5.1.2 Locatable XSV Config File Example
☞ 1.1.5.2 Cloud Data Sources
◦ 1.1.6 Data Source Versioning
• 1.2 Input Variant Data Formats
• 1.3 Output
◦ 1.3.1 Output Data Formats
☞ 1.3.1.1 VCF Format
☞ 1.3.1.2 MAF Format
◦ 1.3.2 Annotations for Pre-Packaged Data Sources
☞ 1.3.2.1 Gencode Annotation Specification
• 1.4 Reference Genome Versions
• 1.5 Comparisons with Oncotator
◦ 1.5.1 Funcotator / Oncotator Feature Comparison
◦ 1.5.2 Oncotator Bugs Compared With Funcotator - Tutorial
• 2.0 Requirements
• 2.1 Running Funcotator in the GATK With Base Options
• 2.2 Optional Parameters
◦ 2.2.1 --ignore-filtered-variants
◦ 2.2.2 --transcript-selection-mode
◦ 2.2.3 --transcript-list
◦ 2.2.4 --annotation-default
◦ 2.2.5 --annotation-override
◦ 2.2.6 --allow-hg19-gencode-b37-contig-matching - FAQ
- Known Issues
- Github
- Tool Documentation
1 - Funcotator Background Information
Funcotator (FUNCtional annOTATOR) analyzes given variants for their function (as retrieved from a set of data sources) and produces the analysis in a specified output file.
This tool allows a user to add their own annotations to variants based on a set of data sources. Each data source can be customized to annotate a variant based on several matching criteria. This allows a user to create their own custom annotations easily, without modifying any Java code.
An example Funcotator workflow based on the GATK Best Practices Somatic Pipeline is as follows:
1.1 - Data Sources
Data sources are expected to be in folders that are specified as input arguments. While multiple data source folders can be specified, no two data sources can have the same name.
Please note that annotations like dbNSFP, SIFT, and PolyPhen (which used to be available within Oncolator) are no longer available. The data resources that Oncolator and Funcolator use are different, and so they are not natively supported.
1.1.1 - Data Source Folders
In each main data source folder, there should be sub-directories for each individual data source, with further sub-directories for a specific reference (e.g. hg19, hg38, etc.). In the reference-specific data source directory, there is a configuration file detailing information about the data source and how to match it to a variant. This configuration file is required.
An example of a data source directory is the following:
dataSourcesFolder/ Data_Source_1/ hg19 data_source_1.config data_source_1.data.file.one data_source_1.data.file.two data_source_1.data.file.three ... hg38 data_source_1.config data_source_1.data.file.one data_source_1.data.file.two data_source_1.data.file.three ... Data_Source_2/ hg19 data_source_2.config data_source_2.data.file.one data_source_2.data.file.two data_source_2.data.file.three ... hg38 data_source_2.config data_source_2.data.file.one data_source_2.data.file.two data_source_2.data.file.three ... ...
1.1.2 - Pre-Packaged Data Sources
The GATK includes two sets of pre-packaged data sources, allowing for Funcotator use without (much) additional configuration. These data source packages correspond to the germline and somatic use cases. Broadly speaking, if you have a germline VCF, the germline data sources are what you want to use to start with. Conversely, if you have a somatic VCF, the somatic data sources are what you want to use to start with.
1.1.2.1 - Downloading Pre-Packaged Data Sources
Versioned gzip archives of data source files are provided here:
- FTP: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/funcotator/
- Google Cloud Bucket: gs://broad-public-datasets/funcotator/
1.1.2.2 - gnomAD
The pre-packaged data sources include a subset of gnomAD, a large database of known variants. This subset contains a greatly reduced subset of INFO fields, primarily containing allele frequency data. gnomAD is split into two parts - one based on exome data, one based on whole genome data. These two data sources are not equivalent and for complete coverage using gnomAD, we recommend annotating with both.
Due to the size of gnomAD, it cannot be included in the data sources package directly. Instead, the configuration data are present and point to a Google bucket in which
the gnomAD data reside. This will cause Funcotator to actively connect to that bucket when it is run.
For this reason, gnomAD is disabled by default.
Because Funcotator will query the Internet when gnomAD is enabled, performance will be impacted by the machine's Internet connection speed. If this degradation is significant, you can localize gnomAD to the machine running Funcotator to improve performance (however due to the size of gnomAD this may be impractical).
1.1.2.2.1 - Enabling gnomAD To enable gnomAD, simply change directories to your data sources directory and untar the gnomAD tar.gz files:
cd DATA_SOURCES_DIR tar -zxf gnomAD_exome.tar.gz tar -zxf gnomAD_genome.tar.gz
1.1.2.2.2 - Included gnomAD Fields The fields included in the pre-packaged gnomAD subset are the following:
Field Name | Field Description |
---|---|
AF | Allele Frequency, for each ALT allele, in the same order as listed |
AF_afr | Alternate allele frequency in samples of African-American ancestry |
AF_afr_female | Alternate allele frequency in female samples of African-American ancestry |
AF_afr_male | Alternate allele frequency in male samples of African-American ancestry |
AF_amr | Alternate allele frequency in samples of Latino ancestry |
AF_amr_female | Alternate allele frequency in female samples of Latino ancestry |
AF_amr_male | Alternate allele frequency in male samples of Latino ancestry |
AF_asj | Alternate allele frequency in samples of Ashkenazi Jewish ancestry |
AF_asj_female | Alternate allele frequency in female samples of Ashkenazi Jewish ancestry |
AF_asj_male | Alternate allele frequency in male samples of Ashkenazi Jewish ancestry |
AF_eas | Alternate allele frequency in samples of East Asian ancestry |
AF_eas_female | Alternate allele frequency in female samples of East Asian ancestry |
AF_eas_jpn | Alternate allele frequency in samples of Japanese ancestry |
AF_eas_kor | Alternate allele frequency in samples of Korean ancestry |
AF_eas_male | Alternate allele frequency in male samples of East Asian ancestry |
AF_eas_oea | Alternate allele frequency in samples of non-Korean, non-Japanese East Asian ancestry |
AF_female | Alternate allele frequency in female samples |
AF_fin | Alternate allele frequency in samples of Finnish ancestry |
AF_fin_female | Alternate allele frequency in female samples of Finnish ancestry |
AF_fin_male | Alternate allele frequency in male samples of Finnish ancestry |
AF_male | Alternate allele frequency in male samples |
AF_nfe | Alternate allele frequency in samples of non-Finnish European ancestry |
AF_nfe_bgr | Alternate allele frequency in samples of Bulgarian ancestry |
AF_nfe_est | Alternate allele frequency in samples of Estonian ancestry |
AF_nfe_female | Alternate allele frequency in female samples of non-Finnish European ancestry |
AF_nfe_male | Alternate allele frequency in male samples of non-Finnish European ancestry |
AF_nfe_nwe | Alternate allele frequency in samples of North-Western European ancestry |
AF_nfe_onf | Alternate allele frequency in samples of non-Finnish but otherwise indeterminate European ancestry |
AF_nfe_seu | Alternate allele frequency in samples of Southern European ancestry |
AF_nfe_swe | Alternate allele frequency in samples of Swedish ancestry |
AF_oth | Alternate allele frequency in samples of uncertain ancestry |
AF_oth_female | Alternate allele frequency in female samples of uncertain ancestry |
AF_oth_male | Alternate allele frequency in male samples of uncertain ancestry |
AF_popmax | Maximum allele frequency across populations (excluding samples of Ashkenazi, Finnish, and indeterminate ancestry) |
AF_raw | Alternate allele frequency in samples, before removing low-confidence genotypes |
AF_sas | Alternate allele frequency in samples of South Asian ancestry |
AF_sas_female | Alternate allele frequency in female samples of South Asian ancestry |
AF_sas_male | Alternate allele frequency in male samples of South Asian ancestry |
OriginalAlleles* | A list of the original alleles (including REF) of the variant prior to liftover. If the alleles were not changed during liftover, this attribute will be omitted. |
OriginalContig* | The name of the source contig/chromosome prior to liftover. |
OriginalStart* | The position of the variant on the source contig prior to liftover. |
ReverseComplementedAlleles* | The REF and the ALT alleles have been reverse complemented in liftover since the mapping from the previous reference to the current one was on the negative strand. |
SwappedAlleles* | The REF and the ALT alleles have been swapped in liftover due to changes in the reference. It is possible that not all INFO annotations reflect this swap, and in the genotypes, only the GT, PL, and AD fields have been modified. You should check the TAGS_TO_REVERSE parameter that was used during the LiftOver to be sure. |
* - only available in hg38
1.1.3 - Data Source Downloader Tool
To improve ease-of-use of Funcotator, there is a tool to download the pre-packaged data sources to the user's machine. This tool is the FuncotatorDataSourceDownloader and can be run to retrieve the pre-packaged data sources from the google bucket and localize them to the machine on which it is run. Briefly: For somatic data sources:
./gatk FuncotatorDataSourceDownloader --somatic --validate-integrity --extract-after-download
For germline data sources:
./gatk FuncotatorDataSourceDownloader --germline --validate-integrity --extract-after-download
1.1.4 - Disabling Data Sources
A data source can be disabled by removing the folder containing the configuration file for that source. This can be done on a per-reference basis. If the entire data source should be disabled, the entire top-level data source folder can be removed.
1.1.5 - User-Defined Data Sources
Users can define their own data sources by creating a new correctly-formatted data source sub-directory in the main data sources folder. In this sub-directory, the user must create an additional folder for the reference for which the data source is valid. If the data source is valid for multiple references, then multiple reference folders should be created. Inside each reference folder, the user should place the file(s) containing the data for the data source. Additionally the user must create a configuration file containing metadata about the data source.
There are several formats allowed for data sources:
Data Format Class | Data Source Description |
---|---|
simpleXSV | Separated value table (e.g. CSV), keyed off Gene Name OR Transcript ID |
locatableXSV | Separated value table (e.g. CSV), keyed off a genome location |
gencode | Class for GENCODE data files (gtf format) |
cosmic | Class for COSMIC data |
vcf | Class for Variant Call Format (VCF) files |
Two of the most useful are arbitrarily separated value (XSV) files, such as comma-separated value (CSV), tab-separated value (TSV). These files contain a table of data that can be matched to a variant by gene name, transcript ID, or genome position. In the case of gene name and transcript ID, one column must contain the gene name or transcript ID for each row's data.
- For gene name, when a variant is annotated with a gene name that exactly matches an entry in the gene name column for a row, that row's other fields will be added as annotations to the variant.
- For transcript ID, when a variant is annotated with a transcript ID that exactly matches an entry in the transcript ID column for a row, that row's other fields will be added as annotations to the variant.
- For genome position, one column must contain the contig ID, another column must contain the start position (1-based, inclusive), and a column must contain the stop position (1-based, inclusive). The start and stop columns may be the same column. When a variant is annotated with a genome position that overlaps an entry in the three genome position columns for a row, that row's other fields will be added as annotations to the variant.
1.1.5.1 - Configuration File Format
The configuration file is a standard Java properties-style configuration file with key-value pairs. This file name must end in .config.
The following is an example of a Locatable XSV configuration file (for the Familial Cancer Genes data source):
name = Familial_Cancer_Genes version = 20110905 src_file = Familial_Cancer_Genes.no_dupes.tsv origin_location = oncotator_v1_ds_April052016.tar.gz preprocessing_script = UNKNOWN # Whether this data source is for the b37 reference. # Required and defaults to false. isB37DataSource = false # Supported types: # simpleXSV -- Arbitrary separated value table (e.g. CSV), keyed off Gene Name OR Transcript ID # locatableXSV -- Arbitrary separated value table (e.g. CSV), keyed off a genome location # gencode -- Custom datasource class for GENCODE # cosmic -- Custom datasource class for COSMIC # vcf -- Custom datasource class for Variant Call Format (VCF) files type = simpleXSV # Required field for GENCODE files. # Path to the FASTA file from which to load the sequences for GENCODE transcripts: gencode_fasta_path = # Required field for GENCODE files. # NCBI build version (either hg19 or hg38): ncbi_build_version = # Required field for simpleXSV files. # Valid values: # GENE_NAME # TRANSCRIPT_ID xsv_key = GENE_NAME # Required field for simpleXSV files. # The 0-based index of the column containing the key on which to match xsv_key_column = 2 # Required field for simpleXSV AND locatableXSV files. # The delimiter by which to split the XSV file into columns. xsv_delimiter = \t # Required field for simpleXSV files. # Whether to permissively match the number of columns in the header and data rows # Valid values: # true # false xsv_permissive_cols = true # Required field for locatableXSV files. # The Name or 0-based index of the column containing the contig for each row contig_column = # Required field for locatableXSV files. # The Name or 0-based index of the column containing the start position for each row start_column = # Required field for locatableXSV files. # The Name or 0-based index of the column containing the end position for each row end_column =
The following is an example of a Locatable XSV configuration file (for the ORegAnno data source):
name = Oreganno version = 20160119 src_file = oreganno.tsv origin_location = http://www.oreganno.org/dump/ORegAnno_Combined_2016.01.19.tsv preprocessing_script = getOreganno.py # Whether this data source is for the b37 reference. # Required and defaults to false. isB37DataSource = false # Supported types: # simpleXSV -- Arbitrary separated value table (e.g. CSV), keyed off Gene Name OR Transcript ID # locatableXSV -- Arbitrary separated value table (e.g. CSV), keyed off a genome location # gencode -- Custom datasource class for GENCODE # cosmic -- Custom datasource class for COSMIC # vcf -- Custom datasource class for Variant Call Format (VCF) files type = locatableXSV # Required field for GENCODE files. # Path to the FASTA file from which to load the sequences for GENCODE transcripts: gencode_fasta_path = # Required field for GENCODE files. # NCBI build version (either hg19 or hg38): ncbi_build_version = # Required field for simpleXSV files. # Valid values: # GENE_NAME # TRANSCRIPT_ID xsv_key = # Required field for simpleXSV files. # The 0-based index of the column containing the key on which to match xsv_key_column = # Required field for simpleXSV AND locatableXSV files. # The delimiter by which to split the XSV file into columns. xsv_delimiter = \t # Required field for simpleXSV files. # Whether to permissively match the number of columns in the header and data rows # Valid values: # true # false xsv_permissive_cols = true # Required field for locatableXSV files. # The Name or 0-based index of the column containing the contig for each row contig_column = 1 # Required field for locatableXSV files. # The Name or 0-based index of the column containing the start position for each row start_column = 2 # Required field for locatableXSV files. # The Name or 0-based index of the column containing the end position for each row end_column = 3
1.1.5.2 - Cloud Data Sources
Funcotator allows for data sources with source files that live on the cloud, enabling users to annotate with data sources that are not physically present on the machines running Funcotator. To create a data source based on the cloud, create a configuration file for that data source and put the cloud URL in as the src_file property (see Configuration File Format for details). E.g.:
... src_file = gs://gcp-public-data--broad-references/hg19/v0/1000G_phase1.snps.high_confidence.b37.vcf.gz ...
1.1.6 - Data Source Versioning
Each release of the data sources contains a version number. Newer versions of Funcotator require minimum versions of data sources in order to run. If a new version of Funcotator is run with an older version of the data sources, an error will be thrown prompting the user to download a new release of the data sources.
Similarly newer releases of the data source packages are not reverse compatible with older versions of Funcotator. However, in this case Funcotator may or may not throw an error or warning.
To ensure compatibility when upgrading Funcotator, always download the latest data sources release. Similarly, when updating data sources make sure to update Funcotator to the latest version.
1.2 - Input Variant Data Formats
Currently Funcotator can only accept input variants in the form of a VCF file.
1.3 - Output
1.3.1 - Output Data Formats
Funcotator supports output in both VCF format and MAF format.
1.3.1.1 - VCF Output
VCF files will contain the annotations for each variant allele as part of a custom INFO tag - FUNCOTATION
. This custom tag will contain a pipe-separated (|) list of annotations for each alternate allele on a given line of the VCF. The VCF header will contain an INFO field comment line for the FUNCOTATION data describing the field name for each value in the pipe-separated list.
#INFO=<ID=FUNCOTATION,Number=A,Type=String,Description="Functional annotation from the Funcotator tool. Funcotation fields are: dbSNP_Val_Status|Center">
For example:
#CHROM POS ID REF ALT QUAL FILTER INFOchr19 8914955 . C A 40 . FUNCOTATION=No Value|broad.mit.edu
In this example, the variant has one alternate allele (A
) with two fields (dbSNP_Val_Status
and Center
). The values of the fields are:
Field Name | Field Value |
---|---|
dbSNP_Val_Status | No Value |
Center | broad.mit.edu |
For variants with multiple alternate alleles, the INFO field will contain multiple lists of annotations (each list separated by a comma), the order of which corresponds to the alternate allele being annotated. For example:
#CHROM POS ID REF ALT QUAL FILTER INFOchr7 273846 . C A,G 40 . FUNCOTATION=No Value|broad.mit.edu,Big Value Here|brandeis.edu
In this example, the variant has one alternate allele (A
) with two fields (dbSNP_Val_Status
and Center
). The values of the fields are:
Alternate Allele | Field Name | Field Value |
---|---|---|
A | dbSNP_Val_Status | No Value |
A | Center | broad.mit.edu |
G | dbSNP_Val_Status | Big Value Here |
G | Center | brandeis.edu |
This formatting is the result of limitations in the VCF file specification.
1.3.1.2 - MAF Output
The MAF format used in Funcotator is an extension of the standard TCGA MAF. It is based on the MAF format specified for Oncotator here, under Output Format. While the actual columns can vary (due to different data sources being used to create annotations), columns 1-67 will generally be the same.
In the case of a variant with multiple alternate alleles, each alternate allele will be written to a separate line in the MAF file.
1.3.2 - Annotations for Pre-Packaged Data Sources
The pre-packaged data sources will create a set of baseline, or default annotations for an input data set. Most of these data sources copy and paste values from their source files into the output of Funcotator to create annotations. In this sense they are trivial data sources.
1.3.2.1 - Gencode Annotation Specification
Funcotator performs some processing on the input data to create the Gencode annotations. Gencode is currently required, so Funcotator will create these annotations for all input variants. See this article for the specification of Gencode annotations in Funcotator.
1.4 - Reference Genome Versions
The two currently supported genomes for annotations out of the box are hg19 and hg38. This is due to the pre-packaged Gencode data sources being for those two references. Any reference genome with published Gencode data sources can be used.
1.4.1 - hg19 vs b37 Reference
The Broad Institute uses an alternate hg19 reference known as b37 for our sequencing. UCSC uses the baseline hg19 reference. These references are similar but different.
Due to the Gencode data source being published by UCSC, the data sources all use the hg19 reference for hg19 data (as opposed to b37). Funcotator detects when user data is from the b37 reference and forces the use of the hg19 data sources in this case. The user is warned when this occurs. Generally speaking this is OK, but due to the differences in the sequence data it is possible that some erroneous data will be created.
This effect has not yet been quantified, but in most cases should not be appreciable. For details, see this forum post.
1.5 - Comparisons with Oncotator
Oncotator is an older functional annotation tool developed by The Broad Institute. Funcotator and Oncotator are fundamentally different tools with some similarities.
While I maintain that a direct comparison should not be made, to address some inevitable questions some comparison highlights between Oncotator and Funcotator are in the following two tables:
1.5.1 - Funcotator / Oncotator Feature Comparison
Funcotator | Oncotator | Notes | |
---|---|---|---|
Override values for annotations | Yes | Yes | |
Default values for annotations | Yes | Yes | |
VCF input | Yes | Yes | |
VCF output | Yes | Yes | Annotation format b/w Funcotator and Oncotator differ. |
MAF input | No | Yes | |
MAF output | Yes | Yes | |
TSV/maflite input | No | Yes | |
Simple TSV output | No | Yes | |
Removing datasources does not require developer | Yes | Yes | |
hg38 support | Yes | No | |
Cloud datasources | Yes | No | All data sources supported |
Transcript override list | Yes | Yes | |
Default config speed somatic (muts/min) (hg19) | |||
Default config speed germline (muts/min) (hg19) | A very long time.... | ||
Default config speed somatic (muts/min) (hg38) | N/A | ||
Default config speed germline (muts/min) (hg38) | N/A | ||
Documentation | Tutorial; Specifications forum post; inclusion in workshop materials | Minimal support in forum | |
Manuscript | Planned | Yes | |
HGVS support | No | Yes | |
BigWig datasource support | No | Linux only | |
Seg file input/output | No | Yes | |
Transcript modes: canonical and most deleterious effect | Yes | Yes | |
Transcript mode: ALL | Yes | No | |
Exclude annotations/columns on CLI | Yes | No | |
Automated datasource download tool | Yes | No | |
Automated tool for creating datasources | No | Yes | |
Web application | No | Yes | Uses old version of Oncotator and datasources |
Config file to specify CLI arguments | Yes | No | GATK built-in command line arguments file |
Simple MAF to VCF | No | Yes | |
Or | |||
VCF to MAF conversion | |||
Inferring ONPs | No | Yes (Not recommended) | Mutect2 infers ONPs when calling variants. This is not the job of a functional annotator. |
Ignores filtered input variants | Yes | Yes | |
Mitochondrial amino acid sequence rendering | Yes | No | |
gnomAD annotations | Yes (cloud support) | Not recommended | v2.1 support for hg19 |
V2.0.2 support for hg38 liftover coming soon | |||
Must be manually enabled | |||
UniProt ID annotations | Yes | Yes | |
Other UniProt annotations (e.g. AAxform) | No | Yes | |
Custom fields: t_alt_count; t_ref_count; etc | MAF Output Only | Yes | |
“other_transcripts” annotation | Yes | Yes | |
Reference context annotations | Yes | Yes | |
COSMIC annotations | Yes | Yes | |
UCSC ID annotations | Yes | Yes | In Funcotator UCSC ID is part of the HGNC data source. |
RefSeq ID annotations | Yes | Yes |
1.5.2 - Oncotator Bugs Compared With Funcotator
Fixed in Funcotator | Fixed in Oncotator | Notes | |
---|---|---|---|
Collapsing ONP counts into one number | N/A | No | |
Variants resulting in protein changes that do not overlap the variant codon itself are not rendered properly | Yes | No | |
Appris ranking not properly sorted | Yes | No | |
Using protein-coding status of gene for sorting (instead of transcript) | Yes | No | |
De Novo Start in UTRs not properly annotated | Yes | No | |
Protein changes for Frame-Shift Insertions on the Negative strand incorrectly rendered | Yes | No | |
MNP End positions incorrectly reported | Yes | No | |
MNPs on the Negative strand have incorrect cDNA/codon/protein changes | Yes | No | |
For Negative strand indels; cDNA string is incorrect | Yes | No | |
Negative strand splice site detection boundary check for indels is incorrect | Yes | No | |
Inconsistent number of bases in reported reference context annotation for indels | Yes | No | |
5’ Flanking variants are reported with an incorrect transcript chosen for Canonical mode | Yes | No | |
Variants overlapping both introns and exons or transcript boundaries are not rendered properly | No | No | Funcotator produces a ‘CANNOT_DETERMINE’ variant classification and minimal populated annotations. |
2 - Tutorial
2.0 - Requirements
- Java 1.8
- A functioning GATK4 jar
- Reference genome (fasta files) with fai and dict files. Human references can be downloaded as part of the GATK resource bundle. Other references can be used but must be provided by the user.
- A local copy of the Funcotator data sources
- A VCF file containing variants to annotate.
2.1 - Running Funcotator in the GATK With Base Options
Open a command line and navigate to your GATK directory.
cd ~/gatk
At this point you should choose your output format. There are two output format choices, one of which must be specified.
Additionally, you must specify a reference version. This reference version is used verbatim to determine which data sources to use for annotations. That is, specifying hg19
will cause Funcotator to look in the <data_sources_dir>/hg19
folder for data sources to use.
A VCF instantiation of the Funcotator tool looks like this:
./gatk Funcotator \ --variant variants.vcf \ --reference Homo_sapiens_assembly19.fasta \ --ref-version hg19 \ --data-sources-path funcotator_dataSources.v1.2.20180329 \ --output variants.funcotated.vcf \ --output-file-format VCF
A MAF instantiation of the Funcotator tool looks like this:
./gatk Funcotator \ --variant variants.vcf \ --reference Homo_sapiens_assembly19.fasta \ --ref-version hg19 \ --data-sources-path funcotator_dataSources.v1.2.20180329 \ --output variants.funcotated.maf \ --output-file-format MAF
2.2 - Optional Parameters
2.2.1 - --ignore-filtered-variants
This flag controls whether Funcotator will annotate filtered variants. By default, this flag is set to true. To annotate filtered variants, run Funcotator with this flag set to false:
./gatk Funcotator \ --variant variants.vcf \ --reference Homo_sapiens_assembly19.fasta \ --ref-version hg19 \ --data-sources-path funcotator_dataSources.v1.2.20180329 \ --output variants.funcotated.maf \ --output-file-format MAF \ --ignore-filtered-variants false
2.2.2 - --transcript-selection-mode
This parameter determines how the primary annotated transcript is determined. The two modes for this parameter are BEST_EFFECT, CANONICAL, and ALL. By default, Funcotator uses the CANONICAL transcript selection mode.
The explanations and rules governing the two transcript selection modes are as follows:
BEST_EFFECT
Select a transcript to be reported with details with priority on effect according to the folowing list of selection criteria:
- Choose the transcript that is on the custom list specified by the user. If no list was specified, treat as if no transcripts were on the list (tie).
- In case of tie, choose the transcript that yields the variant classification highest on the variant classification rank list (see below).
- If still a tie, choose the transcript with highest level of curation. Note that this means lower number is better for level (see below).
- If still a tie, choose the transcript with the best appris annotation (see below).
- If still a tie, choose the transcript with the longest transcript sequence length.
- If still a tie, choose the first transcript, alphabetically.
CANONICAL
Select a transcript to be reported with details with priority on canonical order according to the folowing list of selection criteria:
- Choose the transcript that is on the custom list specified by the user. If no list was specified, treat as if all transcripts were on the list (tie).
- In case of tie, choose the transcript with highest level of curation. Note that this means lower number is better for level (see below).
- If still a tie, choose the transcript that yields the variant classification highest on the variant classification rank list (see below).
- If still a tie, choose the transcript with the best appris annotation (see below).
- If still a tie, choose the transcript with the longest transcript sequence length.
- If still a tie, choose the first transcript, alphabetically.
ALL
Same as CANONICAL, but indicates that no transcripts should be dropped. Render all overlapping transcripts.
2.2.3 - --transcript-list
This parameter will restrict the reported/annotated transcripts to only include those on the given list of transcript IDs. This list can be given as the path to a file containing one transcript ID per line OR this parameter can be given multiple times each time specifying a transcript ID.
When specifying transcript IDs, transcript version numbers will be ignored.
Using a manually specified set of transcripts for the transcript list:
./gatk Funcotator \ --variant variants.vcf \ --reference Homo_sapiens_assembly19.fasta \ --ref-version hg19 \ --data-sources-path funcotator_dataSources.v1.2.20180329 \ --output variants.funcotated.maf \ --output-file-format MAF \ --transcript-list TRANSCRIPT_ID1 \ --transcript-list TRANSCRIPT_ID2
Using an equivalent transcript file:
./gatk Funcotator \ --variant variants.vcf \ --reference Homo_sapiens_assembly19.fasta \ --ref-version hg19 \ --data-sources-path funcotator_dataSources.v1.2.20180329 \ --output variants.funcotated.maf \ --output-file-format MAF \ --transcript-list transcriptFile.txt
Contents of transcriptFile.txt
:
TRANSCRIPT_ID1 TRANSCRIPT_ID2
2.2.4 - --annotation-default
This parameter specifies a default value for an annotation. This default value for this annotation will be used for any annotated variant. However if this annotation would be added by Funcotator to this variant, the Funcotator value will overwrite this default.
To specify this annotation default, the value on the command line takes the format:
ANNOTATION_FIELD:value
For example, to set the Center annotation to broad.mit.edu:
./gatk Funcotator \ --variant variants.vcf \ --reference Homo_sapiens_assembly19.fasta \ --ref-version hg19 \ --data-sources-path funcotator_dataSources.v1.2.20180329 \ --output variants.funcotated.maf \ --output-file-format MAF \ --annotation-default Center:broad.mit.edu
It is valid to provide both the--annotation-default
and --annotation-override
arguments to Funcotator, however the behavior of specifying an annotation-default and an annotation-overrid for the same annotation field is undefined.
2.2.5 - --annotation-override
This parameter specifies an override value for an annotation. If the annotation were to be added to a variant by a data source, the value for that annotation would be replaced with the value specified in the annotation override. If the annotation would not be added by a data source it is added to the output with the given value.
To specify this annotation default, the value on the command line takes the format:
ANNOTATION_FIELD:value
For example, to override the NCBI_Build
annotation to HG19
:
./gatk Funcotator \ --variant variants.vcf \ --reference Homo_sapiens_assembly19.fasta \ --ref-version hg19 \ --data-sources-path funcotator_dataSources.v1.2.20180329 \ --output variants.funcotated.maf \ --output-file-format MAF \ --annotation-override NCBI_Build:HG19
It is valid to provide both the --annotation-override
and --annotation-default
arguments to Funcotator, however the behavior of specifying an annotation-override and an annotation-default for the same annotation field is undefined.
2.2.6 - --allow-hg19-gencode-b37-contig-matching
This flag will cause hg19 contig names to match b37 contig names, allowing a set of variants created on an hg19 reference to match a b37 reference and visa-versa.
hg19 was created by UCSC. b37 was created by the Genome Reference Consortium. In practice these references are very similar but have small differences in certain bases, as well as a different naming convention for chromosomal contigs (chr1
in hg19 vs 1
in b37). In 99.9% of cases the results will be identical, however for certain genomic regions the results will differ.
This flag defaults to true.
To run Funcotator without this hg19/b37 matching:
./gatk Funcotator \ --variant variants.vcf \ --reference Homo_sapiens_assembly19.fasta \ --ref-version hg19 \ --data-sources-path funcotator_dataSources.v1.2.20180329 \ --output variants.funcotated.maf \ --output-file-format MAF \ --allow-hg19-gencode-b37-contig-matching false
3 - FAQ
Why do I not get annotations from my favorite data source on my favorite variant?
This almost always happens when the data source does not overlap the variant. Commonly a variant that is not within a gene will not be annotated by data sources because they are not in the region that the data sources cover (e.g. when the VariantClassification
is IGR
, FIVE_PRIME_FLANK
, COULD_NOT_DETERMINE
, etc.).
This can also happen if the given reference file does not match the data sources' reference (for the pre-packaged data sources either hg19
/b37
or hg38
). In this case, Funcotator will produce a large obnoxious warning:
4 - Known Issues
The current list of known open issues can be found on the GATK github page here.
5 - Github
Funcotator is developed as part of GATK. The GATK github page is here.
6 - Tool Documentation
Tool documentation is written in the source code for Funcotator to better explain the options for running and some details of its features. The tool documentation for Funcotator is here.
7 comments
Dear GATK developers,
I have processed a pair of Tumor/normal tissues that are WES'ed. I have followed the whole process of analysis and acquired the annotated .maf result. After going through it, I didn't see a column that's dedicated to the credibility of each variant.
I recall that when I used the VarScan2, it assigned a P value for each variant. It is calculated somehow by the number of reads supporting either the reference or the alternate from both the tumor and normal samples. I wonder if Mutect2 did the same and if I missed it.
Thank you very much.
I believe there is a typo in the README file in the v.1.7 somatic data source package. The "use case" clearly states "somatic", however, the introduction starts with: "This is a collection of data sources to be used in conjunction with Funcotator to annotate Germline data samples."
Thank you for providing the pre-packaged data sources and the downloader tool, saved me a whole bunch of time! :)
I tried to have funcotator annotate some germline variants. Here is my command in mac zshell terminal:
lc % gatk Funcotator --variant ./chr3q.vcf --reference ./reference/GATK/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta --ref-version hg38 --data-sources-path ./reference/GATK/funcotator/funcotator_dataSources.v1.7.20200521g --output chr3q_funcotated.maf --output-file-format MAF
However, I ran into following error:
I download both the fasta hg38 and funcotator data source bundle from GATK. I noticed that the contig reference chr1 / 248956422 is from hg38; while the contig features = chr1 / 249250621 actually match GRCH37 chr1 length shown in the following screenshot, which is hg19. I specified in `--ref-version hg38`. What causes this error, and how to fix it?? I want to use hg38 because my variants are called using hg38 ref. Thanks for any help.
Lim Chen
We recommend that people create new posts in the general comments section for support questions.
That said, I'm guessing your VCF is aligned to HG19 / B37. VCF files have header rows in them that specify the reference dictionary used when calling the variants (among other things). Are you sure you called your variants on HG38?
How should I interpret a Gnomad allele frequency that is blank? I note that some have 0 values while others are blank entirely. Does this represent missing data not covered by Gnomad?
Is no longer a valid option in GATK/4.4.
it seems there are some typos in the context, Oncolator and Funcolator should be replaced by Oncotator and Funcotator. Isn't it?
Please sign in to leave a comment.