I am new to gatk as well as the forum.
According to the Funcotator tutorial online (excerpt below), the recommendation is using both of two gnomAD sources (exome and whole genome). It’s not clear if I need to create a single VCF with both annotation sets added vs. 2 separate VCFs. The straight-forward way is perhaps to create single VCF is by choosein“VCF” as the output format, and run funcotator again with this as input. My concern is it may duplicate existing annotations if they are present in both data sources sets. For example, the gencode annotations are present in both data source sets. It’s not clear if funcotator will check existing annotations so as not to duplicate them.
I appreciate your input.
22.214.171.124 - gnomAD
The pre-packaged data sources include a subset of gnomAD, a large database of known variants. This subset contains a greatly reduced subset of INFO fields, primarily containing allele frequency data. gnomAD is split into two parts - one based on exome data, one based on whole genome data. These two data sources are not equivalent and for complete coverage using gnomAD, we recommend annotating with both. Due to the size of gnomAD, it cannot be included in the data sources package directly. Instead, the configuration data are present and point to a Google bucket in which the gnomAD data reside. This will cause Funcotator to actively connect to that bucket when it is run.
For this reason, gnomAD is disabled by default.
Because Funcotator will query the Internet when gnomAD is enabled, performance will be impacted by the machine's Internet connection speed.
If this degradation is significant, you can localize gnomAD to the machine running Funcotator to improve performance (however due to the size of gnomAD this may be impractical).
Can you please provide
a) GATK version used - GenomeAnalysisTk/126.96.36.199
Please sign in to leave a comment.