Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Funcotator cannot complete funcotaion for variant due to alternate allele

0

7 comments

  • Avatar
    danilovkiri

    Hi Mark Godek

    Actually, the asterisk in the ALT VCF field is OK and it does not throw an exception, all you see are just warnings. Refer to https://samtools.github.io/hts-specs/VCFv4.3.pdf for an explanation of an asterisk in the ALT field (page 8, section 5). 

    As for the actual StringIndexOutOfBoundsException, you can find the most probable reasons for that here (https://gatk.broadinstitute.org/hc/en-us/community/posts/360060301671-Picard-Liftover-hg38-to-hg19-Error-) in the Bhanu Gandham's answer. You have to make sure the FASTA reference genome, your VCF and the --data-sources-path use the same contig names. Check that all VCFs (yours and in data-sources-path) are sorted, re-sort them just in case. 

    Such exceptions are usually easily googled using the exception type with the magic word "gatk".

    0
    Comment actions Permalink
  • Avatar
    Mark Godek

    Thanks.

    Yeah, I imagined it had something to do with the data sources. I'm using b37 for everything but the annotation sources in the broad bundle only have hg19 and hg38.

    I was hoping it would be find because --allow-hg19-gencode-b37-contig-matching defaults to true.

     

    Is there a place I can get the b37 data sources or will turning on the online gnomeAD feature help?

     

    Maybe it did the contig matching wrong because I told it the reference was hg19 instead of b37. I just wanted it to run, I'll try again with --ref-version b37.

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Look, hg19 (GRCh37) uses chrN-like chromosome naming consensus, b37 (same build de facto) uses N-like chromosome naming. I have never worked with Funcotator and I'm not sure how it handles such cases with --ref-version option. However, I can say that the best practice is to use one contig naming convention everywhere. So look into your germline.filtered.vcf.gz file and see which chromosome names are used (chr1, 1, NC_000001.10, etc). Look into the contig names in the reference genome fasta file and in --data-sources-path if possible. 

    Eventually, you can use bcftools annotate --rename-chrs to rename chromosome names in VCF files.

    0
    Comment actions Permalink
  • Avatar
    Mark Godek

    I tried renaming the chromosomes in the filtered variants before annotating and using an hg19 reference and it's still complaining about contigs.

    I think it annotated fine with the b37 naming, I'm concerned about the StringIndexOutOfBoundsException because it makes the job script exit with a code 250.

    I think the only solution might be to find b37 Annotation_data_sources.

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Can you confirm that Funcotator does produce some sort of output (-O ${OUT}/annotated.germline.vcf)? I mean that the output VCF contains more than just a header. Since your Funcotator log contains the following

    07:33:14.569 WARN GencodeFuncotationFactory - Cannot create complete funcotation for variant at chr12:69756762-69756762 due to alternate allele: *
    07:33:14.575 WARN GencodeFuncotationFactory - Cannot create complete funcotation for variant at chr12:69756763-69756763 due to alternate allele: *
    07:33:14.575 WARN GencodeFuncotationFactory - Cannot create complete funcotation for variant at chr12:69756763-69756763 due to alternate allele: *
    07:33:14.580 WARN GencodeFuncotationFactory - Cannot create complete funcotation for variant at chr12:69756764-69756764 due to alternate allele: *

    I guess you get a part o the input data which is processed. Try to find the exact line in the input VCF file (-V ${OUT}/germline.filtered.vcf.gz) which causes an exception. For that look at the last successfully processed entry in ${OUT}/annotated.germline.vcf and run the analysis starting from that line specifying the -L argument with the region you suppose the error comes from. When you find the entry causing an error, I presume you will be able to find out the source of the exception. Otherwise, post the entry here and we'll look into it.

     

    0
    Comment actions Permalink
  • Avatar
    Mark Godek

    12:70747010 was the last successfully annotated position.

    12:70747693 causes the error.

    12:70753674 would be the next position to be annotated if the exception did not occur.

    12	70747010	.	G	A	84.70	PASS	AC=2;AF=0.200;AN=10;DP=6;ExcessHet=0.2482;FS=0.000;InbreedingCoeff=0.3406;MLEAC=3;MLEAF=0.300;MQ=60.00;POSITIVE_TRAIN_SITE;QD=29.67;SOR=0.693;VQSLOD=19.61;culprit=FS	GT:AD:DP:GQ:PGT:PID:PL:PS	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0	0/0:1,0:1:3:.:.:0,3,36	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0	0/0:1,0:1:3:.:.:0,3,29	0/0:1,0:1:3:.:.:0,3,33	0/0:1,0:1:3:.:.:0,3,33	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0	./.:0,0:0:.:.:.:0,0,0	1|1:0,2:2:6:1|1:70746998_G_A:90,6,0:70746998
    12 70747693 . TAA TA,TAAA,TAAAA,TAAAAA,T 819.41 . AC=6,5,5,6,3;AF=0.200,0.167,0.167,0.200,0.100;AN=30;BaseQRankSum=-6.740e-01;DP=665;ExcessHet=4.7216;FS=23.296;InbreedingCoeff=-0.1901;MLEAC=6,5,4,4,3;MLEAF=0.200,0.167,0.133,0.133,0.100;MQ=59.95;MQRankSum=0.00;QD=11.38;ReadPosRankSum=-8.860e-01;SOR=3.180 GT:AD:DP:GQ:PL 1/4:0,3,0,0,2,0:5:18:104,84,166,86,56,99,57,18,74,81,40,0,57,73,77,102,71,108,86,71,125 1/2:2,3,1,1,0,0:7:8:94,56,78,48,0,59,58,8,66,114,64,18,69,128,159,82,36,77,122,141,145 2/4:1,0,2,0,1,0:5:9:89,65,109,49,57,61,48,87,40,78,9,61,0,44,56,65,109,57,87,61,109 1/3:0,1,0,1,0,0:2:10:31,16,20,37,22,61,20,0,39,35,30,10,52,40,51,37,22,61,39,52,61 1/3:3,6,1,1,0,0:11:7:68,7,84,66,26,142,33,0,124,172,55,19,150,198,273,76,44,165,185,227,227 0/3:3,0,0,2,0,0:5:8:20,21,72,8,54,44,0,60,46,65,21,72,54,60,72,21,72,54,60,72,72 3/5:1,0,0,0,0,3:5:8:70,38,48,73,46,108,52,38,97,113,57,32,103,113,133,8,14,16,0,9,45 0/2:1,0,2,1,0,0:4:2:36,36,71,0,18,9,15,48,2,39,37,76,18,51,84,36,71,18,48,76,71 1/4:1,2,1,0,2,0:6:10:57,35,75,39,20,64,28,10,61,81,14,0,49,78,95,57,43,76,79,69,100 0/4:3,0,0,0,3,2:8:24:64,64,152,43,114,97,24,116,90,105,0,105,74,97,120,31,85,40,35,26,172 2/5:0,1,1,0,0,2:4:16:61,35,49,53,18,72,70,37,79,103,74,45,83,109,131,16,26,0,20,30,67 1/2:0,3,1,0,0,0:4:18:53,27,78,38,0,54,55,18,65,103,65,35,69,115,144,63,32,65,98,112,102 0/4:2,0,1,0,0,0:7:10:70,62,138,54,92,83,10,76,48,60,0,85,38,60,77,62,138,92,76,85,138 0/5:1,0,1,0,0,2:4:3:35,51,96,39,65,64,51,96,65,96,51,96,65,96,96,0,46,3,46,46,65 3/4:1,0,0,2,1,0:4:2:27,25,31,21,28,26,2,11,6,7,14,21,21,0,19,25,31,28,11,21,31
    12 70753674 . G T 50.91 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.3492;MLEAC=4;MLEAF=1.00;MQ=60.00;POSITIVE_TRAIN_SITE;QD=25.46;SOR=0.693;VQSLOD=20.52;culprit=FS GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 1/1:0,2:2:6:49,6,0 ./.:0,0:0:.:0,0,0

    Since the Funcotator worked for chromosomes 1-12 up til position 70747693, I tried running the Funcotator -L with an intervals file of chromosomes 13 through Y and it completed fine. What I may end up doing is using -XL 12:70747693-70753673 to exclude those positions in chromosome 12.

    Do you know why that site couldn't be annotated?

    Unrelated but when I look into the log file, I also got some errors (example below) about variant overlaps. Is this normal or is there a way to correct it? Thanks again.

    01:20:34.783 ERROR GencodeFuncotationFactory - Problem creating a GencodeFuncotation on transcript ENST00000377067.3 for variant: chr13:92050910-92050934(CGGAGGCGGCGGCGGCGGCGGCAGT* -> C): Variant overlaps transcript but is not completely contained within it. Funcotator cannot currently handle this case. Transcript: ENST00000377067.3  Variant: [VC Unknown @ chr13:92050910-92050934 Q738.93 of type=INDEL alleles=[CGGAGGCGGCGGCGGCGGCGGCAGT*, *, C] attr={AC=[4, 1], AF=[0.133, 0.033], AN=30, BaseQRankSum=2.74, DP=269, ExcessHet=0.8219, FS=2.617, InbreedingCoeff=0.2396, MLEAC=[4, 1], MLEAF=[0.133, 0.033], MQ=60.00, MQRankSum=0.00, QD=15.08, ReadPosRankSum=0.499, SOR=0.368} GT=GT:AD:DP:GQ:PL	0/0:14,0,0:14:7:0,7,467,7,467,467	0/0:30,0,0:30:72:0,72,1080,72,1080,1080	0/0:19,0,0:19:45:0,45,675,45,675,675	0/0:13,0,0:13:39:0,39,510,39,510,510	0/0:41,0,0:41:99:0,111,1665,111,1665,1665	0/1:9,2,0:11:57:57,0,371,84,377,461	0/1:11,7,0:18:99:262,0,405,295,426,721	0/0:17,0,0:17:45:0,45,675,45,675,675	0/2:8,0,2:10:60:60,84,405,0,321,315	0/0:14,0,0:14:7:0,7,371,7,371,371	0/0:14,0,0:14:39:0,39,585,39,585,585	0/0:35,0,0:35:99:0,99,1485,99,1485,1485	0/0:9,0,0:9:27:0,27,331,27,331,331	1/1:0,10,0:10:31:451,31,0,451,31,451	0/0:10,0,0:10:30:0,30,368,30,368,368 filters=

    Edit: yeah I was able to get the full exome annotated with the -XL 12:70747693-70753673 option.

    Thanks again.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Mark Godek and danilovkiri

     

    The developer of Funcotator is very interested in this issue and is also looking into it. I have created a github issue ticket for it and you can follow its progress here: https://github.com/broadinstitute/gatk/issues/6651

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk