Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

[Repost] Wrong annotation with Funcotator 1.7

Answered
0

10 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hello A. Brink,

    Thank you for the thorough post! We were able to look into your request, here is the report from the Funcotator Developers:

    The 1.7 Funcotator Datasource is an update to Gencode 34 from Gencode 19. For hg38, the annotations were used directly and for hg19, a Liftover release of Gencode 34 was used. If there are different annotations in the regions in which your variants occur (including alternate slicing transcripts), then you may see differences.

    We were not able to reproduce the chr17:7578492 C>T variant you reported. Is there any chance there is a copy paste error in this case?

    To further look into why a transcript was chosen, you can look into the transcript selection modes. If there is a certain transcript where you want to create primary annotations, you can add those in the transcript-list arguments.

    For the variant chr17:7578492 on hg19 and chr17:7675174 on hg38, here are the resources to see the transcripts:

    https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtModeType=default&lastVirtMo[…]A7578492-7578492&hgsid=992958435_VsgRPNbT85MAYV5FasdxFjs2k4IL [hg19]

    https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtMo[…]675174%2D7675174&hgsid=992958435_VsgRPNbT85MAYV5FasdxFjs2k4IL [hg38]

    There are some differences expected from Gencode 19 to Gencode 34 and these are not issues in Funcotator. I hope this is a helpful explanation of the differences you are seeing, please let me know if you have other questions.

    Genevieve

    -1
    Comment actions Permalink
  • Avatar
    A. Brink

    Dear Genevieve,

    Thanks for your quick response. There indeed turned out to be  a  mistake in pasting of my example, but only in the 1.6 output, not in the (in my opinion) false 1.7 output. This did not had influence on the strange results we obtain.

    Please allow me to repeat my example and its effects. 

    When I use a VCF file (output from GATK 4.1.9.0 Mutect2) and run Funcotator using the 1.7 datasource I get:

    $gatk49 Funcotator --variant input.vcf --reference $ref2 --ref-version hg19 -L chr17:7578492 --data-sources-path $DATA_SOURCES_DIR/funcotator_dataSources.v1.7.20200521s --output output.vcf --output-file-format VCF

    Funcotator output (only first part shown):
    chr17 7578492 . C T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=978,996|1046,1069;DP=4168;ECNT=1;FUNCOTATION=[TP53|hg19|chr17|7578492|7578492|MISSENSE||SNP|C|C|T|g.chr17:7578492C>T|ENST00000269305.8_4|-|7|628|c.686G>A|c.(685-687)tGt>tAt|p.C229Y

    But when I use the same vcf file and run Funcotator using the 1.6 datasource I obtain:

    $gatk49 Funcotator --variant input.vcf --reference $ref2 --ref-version hg19 -L chr17:7578492 --data-sources-path $DATA_SOURCES_DIR/funcotator_dataSources.v1.6.20190124s --output output.vcf --output-file-format VCF

    Funcotator output (only first part shown):
    chr17 7578492 . C T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=978,996|1046,1069;DP=4168;ECNT=1;FUNCOTATION=[TP53|hg19|chr17|7578492|7578492|NONSENSE||SNP|C|C|T|g.chr17:7578492C>T|ENST00000269305.4|-|5|628|c.438G>A|c.(436-438)tgG>tgA|p.W146*

    The p.W146 is what I expected, and is also what can be seen in the transcript resource for hg19 you referred to. I can't find the p.C229 in any of the other transcripts, so this does not seem to be the problem?

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi A. Brink,

    Thank you for your patience while we looked into this. We did find a change in the newest Gencode release that has caused incorrect annotations as you have reported. Thank you for bringing this to our attention!

    We created a ticket on github so that we can solve this issue. There is more information about the problem at that link and you can also follow along for a solution. For now, it seems best to stick with the older data release.

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi A. Brink,

    Just wanted to update you that our developers have found a fix for the issue and it will be in the next GATK release, which should be within the next couple weeks.

    Thank you for helping us find it!

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Stefano Confalonieri

    Hi, I had a similar problem....

    gatk-4.2.0.0, Funcotator v1.7...
    for CCND1 Funcotator annotate a mutation for CCND1

    Hugo_Symbol Entrez_Gene_Id NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Genome_Change Transcript_Strand Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 cDNA_Change Codon_Change Protein_Change dbSNP_ID
    CCND1 595 hg38 chr11 69648142 69648142 + 3'UTR g.chr11:69648142G>A + SNP G G A       rs9344
                                       
                                         


    The problem is that the same mutation is known, and rs9344 is reported as a mutation in the coding sequence of CCND1
    https://www.ncbi.nlm.nih.gov/snp/rs9344?horizontal_tab=true#variant_details

    and as a matter of fact the ClinVar annotation consider it as a risk factor, as it is.

    The problem is that Funcotator uses the gencode transcritps and .gtf file, and the mutation at the same position chr11 69648142 is assigned to a the transcript ENST00000536559.1 winch is an EST with a short CDS, so position 69648142 falls in the 3' UTR.

    However the same mutation is reported for the "real" CCND1 trascript

    Other_Transcripts:
    CCND1_ENST00000227507.3_Splice_Site_p.P241P

    And this polymorphism acts as a crucial risk factor for breast, esophageal, and colorectal cancer but not for cervical cancer.

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6265616/

    This is a real problem.... the first classification of the mutation is made on a transcript that is an EST, and ENSEMBL states: The sequence shown here is derived from an Ensembl automatic analysis pipeline and should be considered as preliminary data

    Since I am interested in mutation in the CDS, I removed this mutation, but when I inspected the BAM file with the IGV software I discovered the problem.

    Is there a way to use the ReFSeq transcript database and gtf files in Funcotator instead of the Gencode DB?

    There are tons of other mutation wrongly assigned to IGR (Inter Genic Region) which instead falls into a CDS of a protein or in an intron.

                                     
                                     
       

     

     
    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Stefano Confalonieri,

    Thanks for posting about these issues you are seeing, we definitely want Funcotator to perform as expected.

    In terms of the transcripts that Funcotator is using here, you might want to check out the --transcript-selection-mode, where you can change how Funcotator orders and selects the transcripts. CANONICAL is default, however, with BEST_EFFECT, you can supply a list of transcripts that will be chosen for representatives of each mutation. There is more information about this in the tutorial here.

    You can also create your own data sources for Funcotator and add RefSeq as a data source. The section in the tutorial on how to include user-defined data sources is here

    In terms of many other mutations wrongly assigned to IGR, this could be some sort of bug and we would like to look into it further. Could you provide examples of a few variants that fall into that category?

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Stefano Confalonieri

    Hi, sorry for the delay in answering. I was unable to use the RefSeq annotation in funcotator.
    And the problem with IGR is because it uses the Gencode annotation.
    as an example:
    hg38 chr1 2059575 2059575 + IGR SNP A A G

    this region has not any transcript form GENCODE but is the locus of PRKCZ
    http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A2059575%2D2059576&hgsid=276126205_44pQyilsG7B13y8y0eFEynC0NbdX

    moreover Funcotator reports correctly at this position an SNP

    rs1878745
     

    https://www.ncbi.nlm.nih.gov/snp/?term=1878745

    So... it is not a bug of funcotator, but a problem of genome annotation. The Gencode is full of crap, ESTs, transcripts not supported by any cDNA, gene isoforms predicted on the basis of a single EST... there are hundreds of mutation assigned to intron or 3' or 5' flanking RNA that actually maps on a transcript and in the coding sequence. It is really annoying.

    All the best

    Stefano

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Stefano,

    I see, thanks for providing the update! Would the option --transcript-list help with ensuring the transcripts you want get picked? 

    Please let me know if you have any other questions.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Stefano Confalonieri

    Well not really.. I would have prefer all CDS genes, from RefSeq, maybe the longest transcript, but I did not manage to make it. I guess that you should look at any mutation and if the gene is of interest, explore any reported intron or IGR or RNA mutation which may eventually reside in the CDS of an alternative transcript of the same gene.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks Stefano for the feedback. If this is a feature request that you would like to see in Funcotator, I would recommend making a post in the General Discussion section with a thorough description of what you want to see. This will help other users find it and let us know if they also would benefit from the feature, as well as help us prioritize a new feature with the development team.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk