[Repost] Wrong annotation with Funcotator 1.7
AnsweredWe started using GATK v4.1.9.0 Funcotator, using the latest (downloaded using FuncotatorDataSourceDownloader) v1.7.20200521s datasource.
However, when running the annotation on a hg19 mapped vcf file, the annotation of Gencode fields is terribly wrong. For example, completely wrong Gencode_34_variantClassification and Gencode_34_proteinChange fields.
Command used:
Funcotator --variant input.vcf --reference $ref2 --ref-version hg19 -L $capture_targets -ip 100 --data-sources-path $DATA_SOURCES_DIR/funcotator_dataSources.v1.7.20200521s --output output.vcf --output-file-format VCF
When we use the previous v1.6.20190124s datasource, with the same command, annotation is correct and as expected.
For example:
hg19 chr17:7578492 C>T
Funcotator 1.7 output: MISSENSE c.686G>A p.C229Y
Funcotator 1.6 output: NONSENSE c.438G>A p.W146*
I've done some additional tests.
- Also when using the v1.7.20200521g germline datasource the annotation mistakes are there.
- Also when I run the Funcotator command without the -L and -ip options the annotation screwup is there.
- When first lifting my vcf file from hg19 to hg38 using GATK LiftoverVcf, and then running Funcotator using the v1.7.20200521g germline datasource with --ref-version hg38 results in output that is correct, and comparable to the v1.6 with --ref-version hg19.
- The annotation output seems wrong for certain genes in my capture panel, some genes turn out to be fine.
- For example, variants in HLA-A (chr6) and RB1 (chr13) are correctly annotated with v1.6.20190124g hg19, v1.7.20200521g hg19 and v1.7.20200521g hg38. In contrast, variants in NOTCH1 (chr9) and TP53 (chr17) are correctly annotated with v1.6.20190124g hg19 and v1.7.20200521g hg38, but wrongly annotated when using v1.7.20200521g hg19.
What is going wrong here?
Funcotator 1.7 |
INFO: Failed to detect whether we are running on Google Compute Engine. |
10:05:16.645 INFO Funcotator - ------------------------------------------------------------ |
10:05:16.646 INFO Funcotator - The Genome Analysis Toolkit (GATK) v4.1.9.0 |
10:05:16.646 INFO Funcotator - For support and documentation go to https://software.broadinstitute.org/gatk/ |
10:05:16.646 INFO Funcotator - Executing xxxx on Linux v3.10.0-693.17.1.el7.x86_64 amd64 |
10:05:16.646 INFO Funcotator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_161-b14 |
10:05:16.646 INFO Funcotator - Start Date/Time: December 18, 2020 10:05:16 AM CET |
10:05:16.646 INFO Funcotator - ------------------------------------------------------------ |
10:05:16.646 INFO Funcotator - ------------------------------------------------------------ |
10:05:16.647 INFO Funcotator - HTSJDK Version: 2.23.0 |
10:05:16.647 INFO Funcotator - Picard Version: 2.23.3 |
10:05:16.647 INFO Funcotator - HTSJDK Defaults.COMPRESSION_LEVEL : 2 |
10:05:16.647 INFO Funcotator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false |
10:05:16.647 INFO Funcotator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true |
10:05:16.647 INFO Funcotator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false |
10:05:16.647 INFO Funcotator - Deflater: IntelDeflater |
10:05:16.647 INFO Funcotator - Inflater: IntelInflater |
10:05:16.647 INFO Funcotator - GCS max retries/reopens: 20 |
10:05:16.647 INFO Funcotator - Requester pays: disabled |
10:05:16.647 INFO Funcotator - Initializing engine |
10:05:17.145 INFO FeatureManager - Using codec VCFCodec to read file file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/vcf/mutect2/M3-0nM-MK-1_MuTect2_filtered.vcf |
10:05:17.210 INFO FeatureManager - Using codec BEDCodec to read file file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/Manifest/200527_HG19_KNOcustom3_capture_targets.bed |
10:05:17.239 INFO IntervalArgumentCollection - Processing 238722 bp from intervals |
10:05:17.247 INFO Funcotator - Done initializing engine |
10:05:17.247 INFO Funcotator - Validating sequence dictionaries... |
10:05:17.248 INFO Funcotator - Processing user transcripts/defaults/overrides... |
10:05:17.249 INFO Funcotator - Initializing data sources... |
10:05:17.269 INFO DataSourceUtils - Initializing data sources from directory: /home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s |
10:05:17.280 INFO DataSourceUtils - Data sources version: 1.7.2020429s |
10:05:17.280 INFO DataSourceUtils - Data sources source: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/funcotator/funcotator_dataSources.v1.7.20200429s.tar.gz |
10:05:17.280 INFO DataSourceUtils - Data sources alternate source: gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.7.20200429s.tar.gz |
10:05:17.316 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/achilles_lineage_results.import.txt -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/achilles/hg19/achilles_lineage_results.import.txt |
10:05:17.337 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/cosmic_tissue.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/cosmic_tissue/hg19/cosmic_tissue.tsv |
10:05:17.347 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/hgnc_download_Nov302017.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/hgnc/hg19/hgnc_download_Nov302017.tsv |
10:05:17.382 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/clinvar_20180401.vcf -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/clinvar/hg19/clinvar_20180401.vcf |
10:05:17.393 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/gencode_xrefseq_v75_37.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/gencode_xrefseq/hg19/gencode_xrefseq_v75_37.tsv |
10:05:17.408 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/clinvar_hgmd.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/clinvar_hgmd/hg19/clinvar_hgmd.tsv |
10:05:18.615 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/gencode.v34lift37.annotation.REORDERED.gtf -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/gencode/hg19/gencode.v34lift37.annotation.REORDERED.gtf |
10:05:18.615 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/gencode.v34lift37.pc_transcripts.fa -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/gencode/hg19/gencode.v34lift37.pc_transcripts.fa |
10:05:18.632 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/Cosmic.db -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/cosmic/hg19/Cosmic.db |
10:05:19.077 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/simple_uniprot_Dec012014.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/simple_uniprot/hg19/simple_uniprot_Dec012014.tsv |
10:05:19.091 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/dnaRepairGenes.20180524T145835.csv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/dna_repair_genes/hg19/dnaRepairGenes.20180524T145835.csv |
10:05:19.099 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/CancerGeneCensus_Table_1_full_2012-03-15.txt -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/cancer_gene_census/hg19/CancerGeneCensus_Table_1_full_2012-03-15.txt |
10:05:19.136 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/Familial_Cancer_Genes.no_dupes.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/familial/hg19/Familial_Cancer_Genes.no_dupes.tsv |
10:05:19.148 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/gencode_xhgnc_v75_37.hg19.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/gencode_xhgnc/hg19/gencode_xhgnc_v75_37.hg19.tsv |
10:05:19.162 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/oreganno.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/oreganno/hg19/oreganno.tsv |
10:05:19.176 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/cosmic_fusion.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/cosmic_fusion/hg19/cosmic_fusion.tsv |
10:05:19.189 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/hg19_All_20180423.vcf.gz -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/dbsnp/hg19/hg19_All_20180423.vcf.gz |
10:05:19.189 INFO Funcotator - Finalizing data sources (this step can be long if data sources are cloud-based)... |
10:05:19.191 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/achilles_lineage_results.import.txt -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/achilles/hg19/achilles_lineage_results.import.txt |
10:05:19.214 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/cosmic_tissue.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/cosmic_tissue/hg19/cosmic_tissue.tsv |
10:05:19.288 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/hgnc_download_Nov302017.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/hgnc/hg19/hgnc_download_Nov302017.tsv |
10:05:19.444 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/clinvar_20180401.vcf -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/clinvar/hg19/clinvar_20180401.vcf |
10:05:19.444 INFO DataSourceUtils - Setting lookahead cache for data source: ClinVar_VCF : 100000 |
10:05:19.459 INFO FeatureManager - Using codec VCFCodec to read file file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/clinvar/hg19/clinvar_20180401.vcf |
10:05:19.688 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/clinvar_20180401.vcf -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/clinvar/hg19/clinvar_20180401.vcf |
10:05:19.782 INFO FeatureManager - Using codec VCFCodec to read file file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/clinvar/hg19/clinvar_20180401.vcf |
10:05:19.883 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/gencode_xrefseq_v75_37.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/gencode_xrefseq/hg19/gencode_xrefseq_v75_37.tsv |
10:05:20.015 INFO DataSourceUtils - Setting lookahead cache for data source: ClinVar : 100000 |
10:05:20.021 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/clinvar_hgmd.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/clinvar_hgmd/hg19/clinvar_hgmd.tsv |
10:05:20.028 INFO FeatureManager - Using codec XsvLocatableTableCodec to read file file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/clinvar_hgmd/hg19/clinvar_hgmd.config |
10:05:20.164 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/clinvar_hgmd.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/clinvar_hgmd/hg19/clinvar_hgmd.tsv |
10:05:20.164 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/clinvar_hgmd.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/clinvar_hgmd/hg19/clinvar_hgmd.tsv |
WARNING 2020-12-18 10:05:20 AsciiLineReader Creating an indexable source for an AsciiFeatureCodec using a stream that is neither a PositionalBufferedStream nor a BlockCompressedInputStream |
10:05:20.168 INFO DataSourceUtils - Setting lookahead cache for data source: gnomAD_exome : 100000 |
10:05:23.181 INFO FeatureManager - Using codec VCFCodec to read file gs://broad-public-datasets/funcotator/gnomAD_2.1_VCF_INFO_AF_Only/hg19/gnomad.exomes.r2.1.sites.INFO_ANNOTATIONS_FIXED.vcf.bgz |
10:05:30.559 INFO FeatureManager - Using codec VCFCodec to read file gs://broad-public-datasets/funcotator/gnomAD_2.1_VCF_INFO_AF_Only/hg19/gnomad.exomes.r2.1.sites.INFO_ANNOTATIONS_FIXED.vcf.bgz |
10:05:31.510 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/gencode.v34lift37.annotation.REORDERED.gtf -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/gencode/hg19/gencode.v34lift37.annotation.REORDERED.gtf |
10:05:31.510 INFO DataSourceUtils - Setting lookahead cache for data source: Gencode : 100000 |
10:05:31.525 INFO FeatureManager - Using codec GencodeGtfCodec to read file file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/gencode/hg19/gencode.v34lift37.annotation.REORDERED.gtf |
10:05:31.611 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/gencode.v34lift37.pc_transcripts.fa -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/gencode/hg19/gencode.v34lift37.pc_transcripts.fa |
10:05:37.142 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/Cosmic.db -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/cosmic/hg19/Cosmic.db |
10:05:38.046 INFO DataSourceUtils - Setting lookahead cache for data source: gnomAD_genome : 100000 |
10:05:41.491 INFO FeatureManager - Using codec VCFCodec to read file gs://broad-public-datasets/funcotator/gnomAD_2.1_VCF_INFO_AF_Only/hg19/gnomad.genomes.r2.1.sites.INFO_ANNOTATIONS_FIXED.vcf.bgz |
10:05:51.224 INFO FeatureManager - Using codec VCFCodec to read file gs://broad-public-datasets/funcotator/gnomAD_2.1_VCF_INFO_AF_Only/hg19/gnomad.genomes.r2.1.sites.INFO_ANNOTATIONS_FIXED.vcf.bgz |
10:05:52.556 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/simple_uniprot_Dec012014.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/simple_uniprot/hg19/simple_uniprot_Dec012014.tsv |
10:05:52.683 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/dnaRepairGenes.20180524T145835.csv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/dna_repair_genes/hg19/dnaRepairGenes.20180524T145835.csv |
10:05:52.694 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/CancerGeneCensus_Table_1_full_2012-03-15.txt -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/cancer_gene_census/hg19/CancerGeneCensus_Table_1_full_2012-03-15.txt |
10:05:52.711 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/Familial_Cancer_Genes.no_dupes.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/familial/hg19/Familial_Cancer_Genes.no_dupes.tsv |
10:05:52.720 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/gencode_xhgnc_v75_37.hg19.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/gencode_xhgnc/hg19/gencode_xhgnc_v75_37.hg19.tsv |
10:05:53.755 INFO DataSourceUtils - Setting lookahead cache for data source: Oreganno : 100000 |
10:05:53.757 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/oreganno.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/oreganno/hg19/oreganno.tsv |
10:05:53.762 INFO FeatureManager - Using codec XsvLocatableTableCodec to read file file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/oreganno/hg19/oreganno.config |
10:05:53.847 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/oreganno.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/oreganno/hg19/oreganno.tsv |
10:05:53.848 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/oreganno.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/oreganno/hg19/oreganno.tsv |
WARNING 2020-12-18 10:05:53 AsciiLineReader Creating an indexable source for an AsciiFeatureCodec using a stream that is neither a PositionalBufferedStream nor a BlockCompressedInputStream |
10:05:53.849 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/cosmic_fusion.tsv -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/cosmic_fusion/hg19/cosmic_fusion.tsv |
10:05:53.869 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/hg19_All_20180423.vcf.gz -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/dbsnp/hg19/hg19_All_20180423.vcf.gz |
10:05:53.869 INFO DataSourceUtils - Setting lookahead cache for data source: dbSNP : 100000 |
10:05:53.881 INFO FeatureManager - Using codec VCFCodec to read file file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/dbsnp/hg19/hg19_All_20180423.vcf.gz |
10:05:54.015 INFO DataSourceUtils - Resolved data source file path: file:///home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/hg19_All_20180423.vcf.gz -> file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/dbsnp/hg19/hg19_All_20180423.vcf.gz |
10:05:54.064 INFO FeatureManager - Using codec VCFCodec to read file file:///home/Test/Pipelines/SeqCap_Pipeline/Necessary_files/GATK/funcotator_dataSources.v1.7.20200521s/dbsnp/hg19/hg19_All_20180423.vcf.gz |
10:05:54.110 INFO Funcotator - Initializing Funcotator Engine... |
10:05:54.138 INFO FuncotatorEngine - Using given VCF and Reference. No conversion required. |
10:05:54.138 INFO Funcotator - Creating a VCF file for output: file:/home/Test/Pipelines/SeqCap_Pipeline/Test/Annotation/funcotator/mutect2/M3-0nM-MK-1_MuTect2_funcotatorS_1.vcf |
10:05:54.189 INFO ProgressMeter - Starting traversal |
10:05:54.189 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute |
10:09:20.968 INFO ProgressMeter - unmapped 3.4 507 147.1 |
10:09:20.969 INFO ProgressMeter - Traversal complete. Processed 507 total variants in 3.4 minutes. |
10:09:20.969 INFO VcfFuncotationFactory - ClinVar_VCF 20180401 cache hits/total: 0/87 |
10:09:20.969 INFO VcfFuncotationFactory - dbSNP 9606_b151 cache hits/total: 0/491 |
10:09:20.969 INFO VcfFuncotationFactory - gnomAD_exome 2.1 cache hits/total: 0/318 |
10:09:20.969 INFO VcfFuncotationFactory - gnomAD_genome 2.1 cache hits/total: 0/455 |
10:09:21.088 INFO Funcotator - Shutting down engine |
[December 18, 2020 10:09:23 AM CET] org.broadinstitute.hellbender.tools.funcotator.Funcotator done. Elapsed time: 4.11 minutes. |
Runtime.totalMemory()=5446828032 |
Tool returned: |
TRUE |
-
Hello A. Brink,
Thank you for the thorough post! We were able to look into your request, here is the report from the Funcotator Developers:
The 1.7 Funcotator Datasource is an update to Gencode 34 from Gencode 19. For hg38, the annotations were used directly and for hg19, a Liftover release of Gencode 34 was used. If there are different annotations in the regions in which your variants occur (including alternate slicing transcripts), then you may see differences.
We were not able to reproduce the chr17:7578492 C>T variant you reported. Is there any chance there is a copy paste error in this case?
To further look into why a transcript was chosen, you can look into the transcript selection modes. If there is a certain transcript where you want to create primary annotations, you can add those in the transcript-list arguments.
For the variant chr17:7578492 on hg19 and chr17:7675174 on hg38, here are the resources to see the transcripts:
There are some differences expected from Gencode 19 to Gencode 34 and these are not issues in Funcotator. I hope this is a helpful explanation of the differences you are seeing, please let me know if you have other questions.
Genevieve
-
Dear Genevieve,
Thanks for your quick response. There indeed turned out to be a mistake in pasting of my example, but only in the 1.6 output, not in the (in my opinion) false 1.7 output. This did not had influence on the strange results we obtain.
Please allow me to repeat my example and its effects.
When I use a VCF file (output from GATK 4.1.9.0 Mutect2) and run Funcotator using the 1.7 datasource I get:
$gatk49 Funcotator --variant input.vcf --reference $ref2 --ref-version hg19 -L chr17:7578492 --data-sources-path $DATA_SOURCES_DIR/funcotator_dataSources.v1.7.20200521s --output output.vcf --output-file-format VCF
Funcotator output (only first part shown):
chr17 7578492 . C T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=978,996|1046,1069;DP=4168;ECNT=1;FUNCOTATION=[TP53|hg19|chr17|7578492|7578492|MISSENSE||SNP|C|C|T|g.chr17:7578492C>T|ENST00000269305.8_4|-|7|628|c.686G>A|c.(685-687)tGt>tAt|p.C229YBut when I use the same vcf file and run Funcotator using the 1.6 datasource I obtain:
$gatk49 Funcotator --variant input.vcf --reference $ref2 --ref-version hg19 -L chr17:7578492 --data-sources-path $DATA_SOURCES_DIR/funcotator_dataSources.v1.6.20190124s --output output.vcf --output-file-format VCF
Funcotator output (only first part shown):
chr17 7578492 . C T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=978,996|1046,1069;DP=4168;ECNT=1;FUNCOTATION=[TP53|hg19|chr17|7578492|7578492|NONSENSE||SNP|C|C|T|g.chr17:7578492C>T|ENST00000269305.4|-|5|628|c.438G>A|c.(436-438)tgG>tgA|p.W146*The p.W146 is what I expected, and is also what can be seen in the transcript resource for hg19 you referred to. I can't find the p.C229 in any of the other transcripts, so this does not seem to be the problem?
-
Hi A. Brink,
Thank you for your patience while we looked into this. We did find a change in the newest Gencode release that has caused incorrect annotations as you have reported. Thank you for bringing this to our attention!
We created a ticket on github so that we can solve this issue. There is more information about the problem at that link and you can also follow along for a solution. For now, it seems best to stick with the older data release.
Thank you,
Genevieve
-
Hi A. Brink,
Just wanted to update you that our developers have found a fix for the issue and it will be in the next GATK release, which should be within the next couple weeks.
Thank you for helping us find it!
Genevieve
-
Hi, I had a similar problem....
gatk-4.2.0.0, Funcotator v1.7...
for CCND1 Funcotator annotate a mutation for CCND1Hugo_Symbol Entrez_Gene_Id NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Genome_Change Transcript_Strand Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 cDNA_Change Codon_Change Protein_Change dbSNP_ID CCND1 595 hg38 chr11 69648142 69648142 + 3'UTR g.chr11:69648142G>A + SNP G G A rs9344
The problem is that the same mutation is known, and rs9344 is reported as a mutation in the coding sequence of CCND1
https://www.ncbi.nlm.nih.gov/snp/rs9344?horizontal_tab=true#variant_detailsand as a matter of fact the ClinVar annotation consider it as a risk factor, as it is.
The problem is that Funcotator uses the gencode transcritps and .gtf file, and the mutation at the same position chr11 69648142 is assigned to a the transcript ENST00000536559.1 winch is an EST with a short CDS, so position 69648142 falls in the 3' UTR.
However the same mutation is reported for the "real" CCND1 trascript
Other_Transcripts:
CCND1_ENST00000227507.3_Splice_Site_p.P241PAnd this polymorphism acts as a crucial risk factor for breast, esophageal, and colorectal cancer but not for cervical cancer.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6265616/
This is a real problem.... the first classification of the mutation is made on a transcript that is an EST, and ENSEMBL states: The sequence shown here is derived from an Ensembl automatic analysis pipeline and should be considered as preliminary data
Since I am interested in mutation in the CDS, I removed this mutation, but when I inspected the BAM file with the IGV software I discovered the problem.
Is there a way to use the ReFSeq transcript database and gtf files in Funcotator instead of the Gencode DB?
There are tons of other mutation wrongly assigned to IGR (Inter Genic Region) which instead falls into a CDS of a protein or in an intron. -
Thanks for posting about these issues you are seeing, we definitely want Funcotator to perform as expected.
In terms of the transcripts that Funcotator is using here, you might want to check out the --transcript-selection-mode, where you can change how Funcotator orders and selects the transcripts. CANONICAL is default, however, with BEST_EFFECT, you can supply a list of transcripts that will be chosen for representatives of each mutation. There is more information about this in the tutorial here.
You can also create your own data sources for Funcotator and add RefSeq as a data source. The section in the tutorial on how to include user-defined data sources is here.
In terms of many other mutations wrongly assigned to IGR, this could be some sort of bug and we would like to look into it further. Could you provide examples of a few variants that fall into that category?
Best,
Genevieve
-
Hi, sorry for the delay in answering. I was unable to use the RefSeq annotation in funcotator.
And the problem with IGR is because it uses the Gencode annotation.
as an example:
hg38 chr1 2059575 2059575 + IGR SNP A A G
this region has not any transcript form GENCODE but is the locus of PRKCZ
http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A2059575%2D2059576&hgsid=276126205_44pQyilsG7B13y8y0eFEynC0NbdX
moreover Funcotator reports correctly at this position an SNPrs1878745 https://www.ncbi.nlm.nih.gov/snp/?term=1878745
So... it is not a bug of funcotator, but a problem of genome annotation. The Gencode is full of crap, ESTs, transcripts not supported by any cDNA, gene isoforms predicted on the basis of a single EST... there are hundreds of mutation assigned to intron or 3' or 5' flanking RNA that actually maps on a transcript and in the coding sequence. It is really annoying.
All the bestStefano
-
Hi Stefano,
I see, thanks for providing the update! Would the option --transcript-list help with ensuring the transcripts you want get picked?
Please let me know if you have any other questions.
Best,
Genevieve
-
Well not really.. I would have prefer all CDS genes, from RefSeq, maybe the longest transcript, but I did not manage to make it. I guess that you should look at any mutation and if the gene is of interest, explore any reported intron or IGR or RNA mutation which may eventually reside in the CDS of an alternative transcript of the same gene.
-
Thanks Stefano for the feedback. If this is a feature request that you would like to see in Funcotator, I would recommend making a post in the General Discussion section with a thorough description of what you want to see. This will help other users find it and let us know if they also would benefit from the feature, as well as help us prioritize a new feature with the development team.
Please sign in to leave a comment.
10 comments