Funcotator errors
Hello,
I am trying to use the Funcotator workflow in Terra
https://dockstore.org/workflows/github.com/broadinstitute/gatk/Funcotator:4.1.7.0
To annotate a VCF that was produced by the Mutect2_pon workflow
https://dockstore.org/workflows/github.com/broadinstitute/gatk/mutect2_pon:4.1.6.0?tab=info
GATK docker that I used for both analyses is: "us.gcr.io/broad-gatk/gatk:4.1.7.0"
Funcotator (first linked workflow) gives me two types of errors:
1) A lot of these:
GencodeFuncotationFactory - Cannot create complete funcotation for variant at chr4:9249899-9249899 due to alternate allele: *
2) And then also:
*********************************************************************** A USER ERROR has occurred: Unknown file is malformed: File contains a bad codon sequence that has no amino acid equivalent: CNN ***********************************************************************
Workspace is called 661-Clonal hematopoiesis and your team already has access to it.
The job ID is 3aef30a2-8caf-4bef-8b5d-8574c16a096a and you should be able to access all of the other relevant details that you may need to troubleshoot. The workflow is called Funcotator_copy (and is just a copy pasted WDL of the Funcotator workflow linked above, w/o any changes made whatsoever).
Could you please first check that all of my input variables look OK and that the genome versions are matching (I believe they should be). Then if the input looks OK, I would appreciate if you could please let me know what to try next.
Many thanks,
Mia
-
Hi Mia,
Thanks for writing in. The error message suggests that the BAM or SAM is malformed—can you confirm that your file is valid using ValidateSamFile?
https://gatk.broadinstitute.org/hc/en-us/articles/360042478272-ValidateSamFile-Picard-
Many thanks,
Jason
-
Hi Jason,
Thanks. Not sure that I understand because I am not using BAM nor SAM files in this workflow at all - my input file is a VCF that was produced by Mutect2.. you can see how was the analysis set up following the instructions in my post above.
Please let me know and many thanks,
Mia
-
Hi Mia,
Sorry for the confusion there—in this case you may want to validate the VCF to ensure there isn't anything malformed about it: https://gatk.broadinstitute.org/hc/en-us/articles/360042914291-ValidateVariants
I also see in the log that there's this warning:
21:50:43.053 WARN FuncotatorEngine - WARNING: You are using B37 as a reference. Funcotator will convert your variants to GRCh37, and this will be fine in the vast majority of cases. There MAY be some errors (e.g. in the Y chromosome, but possibly in other places as well) due to changes between the two references.
I can't say for certain whether this is the underlying issue, but ValidateVariants should be fairly revealing in what's wrong with the VCF, if anything.
Kind regards,
Jason
-
Thanks Jason:
1) The VCF I am trying to annotate was generated using Mutect_pon workflow. Bam files were in hg19 reference, but the ref_fai, ref_fasta and ref_dict arguments I specified used b37 reference - can you please check with the developers if this may be giving me the issue now when I am trying to Funcotate the resulting PON vcf?
2) Would it please be possible to link me to public files of ref_fai, ref_fasta and ref_dict arguments (required for all sorts of GATK workflows) for both h19 and GRCh37 references?
3) Would it please be possible to let me know or point me to a documentation regarding how compatible are GRCh37, hg19 and b37 references across the GATK pipelines - in terms of both expected results and running one against another. For example:
3a) can I use PON from b37 in Mutect2 analysis using bams in GRCh37?
3b) If bam files are mapped to one reference (GRCh37, hg19 or b37) do ref_fai, ref_fasta and ref_dict arguments used in Mutect have to match the reference used in mapping or any from the three reference types can be used? I understand that reference needs to be matching between GRCh37 and GRCh38, but not sure what is the case for different versions of GRCh37 (i.e. hg19 and b37)
Thanks
Mia
-
Hi Mia,
2) You can find the resources in the GATK Resource Bundle. The files of interest are found in the Broad-owned bucket gs://gcp-public-data--broad-references. You can access these files in the console by going here: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references
You can find the exact path for the file by examining the file within the workspace (such as in one of the featured workspaces).
For your other questions, can you take a look at GATK's Human genome reference builds documentation, particularly the section titled Legacy assemblies, and let me know if/which questions remain?
Many thanks,
Jason
-
Hi Jason,
Thanks for looking into this:
1) The link you gave me refers to three references as 'b37/GRCh37 and hg19', and suggests that b37 and GRCh37 are different from hg19, but does not talk about possible differences between b37 and GRCh37. However, the Functotator error noted:
You are using B37 as a reference. Funcotator will convert your variants to GRCh37, and this will be fine in the vast majority of cases. There MAY be some errors (e.g. in the Y chromosome, but possibly in other places as well) due to changes between the two references.
Thus b37 and GRCh37 appear to be different (as was also my understanding) and this link does not explain whether they are different, nor how interchangeable these are in GATK resources and workflows.
Would it please be possible to link me to a resource that answers those two exact questions, and if not, I would appreciate if you or somebody from your team can please comment.
2) If b37 is indeed different from GRCh37, can you please link me to GRCh37 resources, as the ones you provided are only for hg19/b37 ? I am specifically after GRCh37 files for ref_fai, ref_fasta and ref_dict arguments.
3) Thanks for providing the link to bundle. I see that 'hg19' folder contains files termed 'b37', which is further confusing because the document you linked says that hg19 and b37 are different. Can you please clarify with the team? If they are different, can I please be linked to hg19 resources (the same ones as above)?
4) Can somebody from the team advise whether Funcotator would fail at all because of the error you noted with regards to the references? It is possible that we are going down a completely wrong road here...
5) I would love to validate the VCF using the tool you suggested, but the documentation says nothing about how to install that tool. In other words, typing gatk into terminal, even after calling the gatk dotkit, returns
gatk: command not found
I should also note again, that the VCF that I am feeding into Funcotator is a direct and unmodified version of the VCF produced by Mutect2 - so if this was a formatting issue, it would again be a question for the GATK team. Either way, I would appreciate if somebody can look at this more closely and look at the actual files, that are all available as part of the referenced workflow and jobID in my first post.
Thanks
Mia
-
Hi Mia,
Let's try to get the ValidateVariants tool running first to see if this is an angle worth tackling, and then revisit the rest of the questions where needed. I'll be happy to see if we can get you some assistance from the GATK team, or others who are familiar with the workflow's functionality, once we confirm whether the vcf is valid so that they don't need to be worried about that part playing a role.
The ValidateVariants tool comes with GATK, which you can download using the button at the top-right of the website.
As is the case with any of these tools, you can use them by downloading GATK4. You can find more information about how to install and use GATK in the Getting Started with GATK4 article found in the Getting Started section of the User Guide.
The error you are seeing on the Broad server is likely an issue with the dotkit being malformed, not with GATK in general. You would have to write in to BITS to get it fixed. It's also quite old so I would recommend inquiring about creation of a dotkit with a more recent GATK version (server has 4.0.4 and latest is 4.1.8.1).
If you're interested in that route, you can request a dotkit by going here: https://broad.service-now.com/sp?id=sc_cat_item&sys_id=c48a528bdb1da3400f1b6033ca96190d
Let us know how it goes!
Kind regards,
Jason
-
Hi Jason,
I ran the Validate Variants tool as follows:
./gatk ValidateVariants -R /Users/mpetljak/Desktop/Mutect_input_Homo_sapiens_assembly19.fasta -V /Users/mpetljak/Desktop/37be10ca-d8da-4d2c-898f-5f9b108bae96_Mutect2_Panel_092bf821-bc47-48d9-9561-d3c3fbf874d6_call-MergeVCFs_GTEX_WES_GRCh37_Under40.vcf -dbsnp /Users/mpetljak/Desktop/hg19_v0_Homo_sapiens_assembly19.dbsnp.vcf
The output is copied below, I am not sure what it means because the documentation for the tool only says how to run it, but it contains no information on expected outputs/outcomes. Can you please let me know what are the next steps to get funcotator working?
/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
Aug 17, 2020 4:45:32 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
16:45:32.559 INFO ValidateVariants - ------------------------------------------------------------
16:45:32.559 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.1.8.1
16:45:32.559 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
16:45:32.559 INFO ValidateVariants - Executing as mpetljak@wma06-4df on Mac OS X v10.14.6 x86_64
16:45:32.559 INFO ValidateVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v14.0.2+12-46
16:45:32.559 INFO ValidateVariants - Start Date/Time: August 17, 2020 at 4:45:32 PM EDT
16:45:32.559 INFO ValidateVariants - ------------------------------------------------------------
16:45:32.559 INFO ValidateVariants - ------------------------------------------------------------
16:45:32.560 INFO ValidateVariants - HTSJDK Version: 2.23.0
16:45:32.560 INFO ValidateVariants - Picard Version: 2.22.8
16:45:32.560 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:45:32.560 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:45:32.560 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:45:32.560 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:45:32.560 INFO ValidateVariants - Deflater: IntelDeflater
16:45:32.560 INFO ValidateVariants - Inflater: IntelInflater
16:45:32.560 INFO ValidateVariants - GCS max retries/reopens: 20
16:45:32.560 INFO ValidateVariants - Requester pays: disabled
16:45:32.560 INFO ValidateVariants - Initializing engine
16:45:32.729 INFO FeatureManager - Using codec VCFCodec to read file file:///Users/mpetljak/Desktop/hg19_v0_Homo_sapiens_assembly19.dbsnp.vcf
16:45:32.843 INFO FeatureManager - Using codec VCFCodec to read file file:///Users/mpetljak/Desktop/37be10ca-d8da-4d2c-898f-5f9b108bae96_Mutect2_Panel_092bf821-bc47-48d9-9561-d3c3fbf874d6_call-MergeVCFs_GTEX_WES_GRCh37_Under40.vcf
16:45:32.855 INFO ValidateVariants - Done initializing engine
16:45:32.855 INFO ProgressMeter - Starting traversal
16:45:32.856 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
16:45:43.183 INFO ProgressMeter - 1:214394669 0.2 46000 267312.3
16:45:53.502 INFO ProgressMeter - 2:153871947 0.3 85000 247021.2
16:46:03.707 INFO ProgressMeter - 3:106520887 0.5 114000 221710.8
16:46:13.914 INFO ProgressMeter - 4:106751028 0.7 149000 217740.8
16:46:24.193 INFO ProgressMeter - 5:97133431 0.9 178000 208041.1
16:46:34.196 INFO ProgressMeter - 6:57560717 1.0 217000 212259.5
16:46:44.555 INFO ProgressMeter - 7:90417820 1.2 260000 217576.3
16:46:54.862 INFO ProgressMeter - 8:134230179 1.4 298000 218032.8
16:47:04.902 INFO ProgressMeter - 10:42364888 1.5 341000 222282.6
16:47:14.938 INFO ProgressMeter - 11:71932130 1.7 379000 222766.5
16:47:25.235 INFO ProgressMeter - 12:85121928 1.9 403000 215166.7
16:47:35.296 INFO ProgressMeter - 14:55129132 2.0 433000 212185.6
16:47:45.443 INFO ProgressMeter - 16:73825612 2.2 491000 222193.7
16:47:55.540 INFO ProgressMeter - 19:19682119 2.4 547000 230038.1
16:48:05.743 INFO ProgressMeter - X:13879124 2.5 621000 243711.0
16:48:10.581 INFO ProgressMeter - Y:59013982 2.6 653885 248743.7
16:48:10.581 INFO ProgressMeter - Traversal complete. Processed 653885 total variants in 2.6 minutes.
16:48:10.582 INFO ValidateVariants - Shutting down engine
[August 17, 2020 at 4:48:10 PM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 2.64 minutes.
Runtime.totalMemory()=1028653056
-
Hello,
An update: I changed the pre-packaged Funcotator data source version from funcotator_dataSources.v1.6.20190124s.tar.gz to the latest version funcotator_dataSources.v1.7.20200521s.tar.gz.
This took the workflow further down the line, but I am now getting an error that suggest there may be something wrong with one of the GATK source files: file:///cromwell_root/datasources_dir/gencode/hg19/gencode.v34lift37.annotation.REORDERED.gtf
The relevant part of the log is copied below and the full version can be accessed here:
Can you please let me know what are next steps?
Thanks,
Mia
htsjdk.tribble.TribbleException$MalformedFeatureFile: Error parsing line: LineIteratorImpl(SynchronousLineReader), for input source: file:///cromwell_root/datasources_dir/gencode/hg19/gencode.v34lift37.annotation.REORDERED.gtf at htsjdk.tribble.TribbleIndexedFeatureReader$QueryIterator.readNextRecord(TribbleIndexedFeatureReader.java:510) at htsjdk.tribble.TribbleIndexedFeatureReader$QueryIterator.<init>(TribbleIndexedFeatureReader.java:426) at htsjdk.tribble.TribbleIndexedFeatureReader.query(TribbleIndexedFeatureReader.java:297) at org.broadinstitute.hellbender.engine.FeatureDataSource.refillQueryCache(FeatureDataSource.java:567) at org.broadinstitute.hellbender.engine.FeatureDataSource.queryAndPrefetch(FeatureDataSource.java:536) at org.broadinstitute.hellbender.engine.FeatureManager.getFeatures(FeatureManager.java:353) at org.broadinstitute.hellbender.engine.FeatureContext.getValues(FeatureContext.java:173) at org.broadinstitute.hellbender.tools.funcotator.DataSourceFuncotationFactory.queryFeaturesFromFeatureContext(DataSourceFuncotationFactory.java:304) at org.broadinstitute.hellbender.tools.funcotator.DataSourceFuncotationFactory.getFeaturesFromFeatureContext(DataSourceFuncotationFactory.java:219) at org.broadinstitute.hellbender.tools.funcotator.DataSourceFuncotationFactory.createFuncotations(DataSourceFuncotationFactory.java:197) at org.broadinstitute.hellbender.tools.funcotator.DataSourceFuncotationFactory.createFuncotations(DataSourceFuncotationFactory.java:172) at org.broadinstitute.hellbender.tools.funcotator.FuncotatorEngine.lambda$createFuncotationMapForVariant$0(FuncotatorEngine.java:147) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.broadinstitute.hellbender.tools.funcotator.FuncotatorEngine.createFuncotationMapForVariant(FuncotatorEngine.java:157) at org.broadinstitute.hellbender.tools.funcotator.Funcotator.enqueueAndHandleVariant(Funcotator.java:903) at org.broadinstitute.hellbender.tools.funcotator.Funcotator.apply(Funcotator.java:857) at org.broadinstitute.hellbender.engine.VariantWalker.lambda$traverse$0(VariantWalker.java:104) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206) at org.broadinstitute.hellbender.Main.main(Main.java:292) Caused by: java.lang.NumberFormatException: For input string: "chr1:+:11869-12227" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.valueOf(Long.java:803) at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfFeature.<init>(GencodeGtfFeature.java:224) at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfExonFeature.<init>(GencodeGtfExonFeature.java:19) at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfExonFeature.create(GencodeGtfExonFeature.java:23) at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfFeature$FeatureType$4.create(GencodeGtfFeature.java:777) at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfFeature.create(GencodeGtfFeature.java:320) at org.broadinstitute.hellbender.utils.codecs.gtf.AbstractGtfCodec.decode(AbstractGtfCodec.java:138) at org.broadinstitute.hellbender.utils.codecs.gtf.AbstractGtfCodec.decode(AbstractGtfCodec.java:23) at htsjdk.tribble.TribbleIndexedFeatureReader$QueryIterator.readNextRecord(TribbleIndexedFeatureReader.java:486) ... 43 more
-
Hi Mia,
I'm glad to hear there's been some progress with the ValidateVariants tool. Can you share the file with jcerrato@broadinstitute.org so I can examine it in-depth? I'm currently getting an error trying to access it.
Can you confirm that file:///cromwell_root/datasources_dir/gencode/hg19/gencode.v34lift37.annotation.REORDERED.gtf came from the funcotator data sources tar.gz? If it's from somewhere else, please let me know!
It looks like it's possible that the tool isn't playing well with our own files or the standard genomic reference files, I'm happy to get some more experienced GATK help here!
Kind regards,
Jason
-
Hi Jason,
The file flagged in the log that might be the issue is not the one that I specified, and is either part of the GATK's embedded Funcotator workflow or part of the Funcotator sources provided by the GATK. To see if the latter is the case, you can see whether this file exists in (gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.7.20200521s.tar.gz ) that I used; but either way if you determine from the log that this particular file is an issue, it may be a good time to bring in the GATK team because whatever is the source of the file, it is not user-specified.
Can you along the way please let me know whether you were happy with the VCF validation from above - does the output suggest that the VCF is OK and how do you determine that?
Thanks,
Mia
-
Hi Mia,
A couple of my colleagues have verified that your output for ValidateVariants suggests there aren't any issues—if there were, you would get warnings/errors.
I'm reaching out to an in-house GATK expert to more closely examine the log you've provided.
Kind regards,
Jason
-
Hi Mia,
They've informed me that this same issue has been brought up recently. The conclusion is that GATK is not ready for the v1.7 data sources. There is currently an incompatibility that will be updated shortly. For now they recommend using v1.6.20190124.
Read more about that other reported issue here: https://gatk.broadinstitute.org/hc/en-us/community/posts/360072132411-Funcotator-datasources-v1-7-gencode-raise-error
However, I remember in your original inquiry that you reported your issue when using v1.6.2019012. Is it accurate to say that this would still be an issue at present, or would you need to rerun the job with any changes you've made (if any) to confirm? If there is still an issue, please let me know and I'll bring it up with a GATK expert right away.
P.S. I noticed in your original inquiry that you used version Mutect2 version 4.1.6.0 to annotate the VCF you are now trying to use in Funcotator. If you do need to rerun a job, can you also try with the 4.1.6.0 version of Funcotator and GATK to see if keeping things consistent with versions helps resolve some issues?
Let me know.
Many thanks,
Jason
-
Hi Jason,
Thanks for following up. Correct, the original issue was raised re. failures I was getting when running with v1.6.2019012 Funcotator sources.
I could try the 4.1.6.0. version of Funcotator, to have the version matched with the version of Mutect used to produce PON that I am trying to Funcotate, but I actually used 4.1.7.0. docker with Mutect 4.1.6.0, because 4.1.6.0. docker was bugged and it gave me issues before (https://gatk.broadinstitute.org/hc/en-us/community/posts/360060174372-Haplotype-Caller-4-1-6-0-java-lang-IllegalStateException-Smith-Waterman-alignment-failure-). We discussed this on one of the separate threads. So I could not get all the things down to 4.1.6.0 (because docker is bugged), neither to .7 version (because sources are not ready as you explained).
It would be great if you can please ask somebody from the team to look at my original enquiry, relevant information summarized below:
1) I created a PON using Mutect 4.1.6.0. with 4.1.7.0. docker; because 4.1.6.0. docker was bugged
2) I am trying to annotate the PON VCF with Funcotator - I am feeding VCF from step 1 into analyses directly, w/o any modifications to VCF. VCF was checked with ValidateVariants workflow. Funcotrator info:
Funcotator version: 4.1.7.0
GATK docker: 4.1.7.0
Funcotator sources: funcotator_dataSources.v1.6.20190124s.tar.gz
3) I am sending you logs of two failed attempts via email, as well as the input json file.
4) For other information workspace is called 661-Clonal hematopoiesis and your team already has access to it. The job ID is 3aef30a2-8caf-4bef-8b5d-8574c16a096a and you should be able to access all of the other relevant details that you may need to troubleshoot.
5) I suspect the issue is to do with source files, because when I tried a later version (.7) the workflow got further down the line, but then failed again (presumably because as you explained, this version of sources is not ready yet)
Thanks,
Mia
-
Hi Mia,
Just as an update, we're looking to get advice from the author of the Funcotator workflow on this error.
Kind regards,
Jason
-
Hi Mia,
Here's what I've heard from the author:
Funcotator does not produce full funcotations for spanning deletions, so any `*` alt alleles will generate the message:GencodeFuncotationFactory - Cannot create complete funcotation for variant at chr4:9249899-9249899 due to alternate allele: *
For these alleles, some annotations will be produced, but a "complete" funcotation will not be.
The real error here is this second message.Funcotator expects that all Alt alleles will be standard bases (ATGC
) and assumes that the genome sequence around all variants will also be standard bases (ATGC
).One of two things has happened.- M2 has somehow produced a variant that includes
N
bases. (unlikely) - The region in the reference to which a variant in their file mapped includes
N
bases. (more likely)
Can you search your VCF for Alt alleles containingN
bases? If any of these exist, then this is the problem. Let us know if this is the case.If not, then this is a bug in how Funcotator handlesN
bases in the predicted protein sequence and the author would like to know which variant is causing this problem. You should be able to look at the last annotated variant in your output. Please give us the coordinates of the next variant in your input file after the last annotated variant.Kind regards,Jason - M2 has somehow produced a variant that includes
-
Hi Jason,
Thanks for addressing this with the team.
I checked and there are no 'N' variants among the wild-type bases the VCF.
In terms of the next variant in the input, following the last annotated one:
#CHROM POS ID REF ALT QUAL FILTER INFO
4 9274640 . A ATCACTG,ATCCTG . . BETA=0.989,0.141;FRACTION=0.022
Thanks,
Mia
P.S. if this does not resolve it, the Funcotator outputs from that failed job are here:
I am hoping you might be able to open with the access we gave you before
-
Hi Mia,
Thanks for that. I've passed the details on to the author and I can confirm I am able to access that link.
I'll let you know once I hear back.
Kind regards,
Jason
-
Thanks Jason. I'd really appreciate every effort to help me resolve this by the end of this week as we have not been able to move these analyses anywhere forward for 3 weeks now (since the issue was first raised).
Best wishes,
Mia
-
Hi Mia,
I'm still awaiting word from the author. I sent an update request yesterday—I'll send another today if I don't hear by 11AM.
Kind regards,
Jason
-
Hi Mia,
I've heard back from the author. They were able to reproduce the error with the variant you provided. They're digging deeper into the issue, but they think it's looking like a bug.
In the meantime, if you can remove that variant from the file, Funcotator should run correctly on the rest of the file (assuming no other variants have the same issue). If you find that any other variants are, please let me know and I'll flag them up with the author ASAP.
Kind regards,
Jason -
Thanks Jason, I will try that.
Given that I need to modify the vcf and create a new index file, can you please let me know :
1. is the workflow going to accept .vcf.gz.tbi index format ?
2. if not, how would I generate .idx index file that is output of Mutect2 ?
I tried this, but this is giving be .gz.tbi (https://gatk.broadinstitute.org/hc/en-us/articles/360036899892-IndexFeatureFile)
3. if yes, is .vcf.gz.tbi compatible with .vcf, or should the vcf be vcf.gz ?
Thanks,
Mia
-
Hi Mia,
Here's what I've heard back from the author.
Short answer to all 3: It should accept a vcf.gz.tbi, and everything should "just work"With respect to #3 - Yes - I think it should be g-zipped. iI this doesn't work I can try to put in the fix today / get it in Monday.Let us know how it goes for you and I'll let them know if you run into any trouble.Kind regards,Jason -
Hi Jason,
I was rerunning the same workflow with the new vcf and new vcf.idx and now I get the following error:
The job ran further than the previous variant, but I still get the 'incompletely' funcotated VCF.
I do not think that the error is to do with a specific variant, as it was last time, because there is no such error in .err file, nor in the worklog.
Submission ID: 753d047a-e91a-4134-a200-b753cfa2bd0a
Workspace ID: a46c7502-d26e-4217-b1d4-7d80a20d7456
Worklog: gs://fc-secure-a46c7502-d26e-4217-b1d4-7d80a20d7456/753d047a-e91a-4134-a200-b753cfa2bd0a/Funcotator/f0a0794a-eaf5-4b10-884d-5604b1aadda0/call-Funcotate/Funcotate.log
I see the same error reported here:
And the suggestion that worked was to add continueOnReturnCode item in the runtime block for the Funcotate task.
However, I do not know how to do that. If you think the issue is the same, can you please let me know where and what to insert into my WDL, so that I can copy-paste it? I pasted the entire WDL as I have it now below.
Thanks!
Mia
Synopsis: Funcotator 4.1.7.0Run workflow with inputs defined by file pathsRun workflow(s) with inputs defined by data tableUse call caching Delete intermediate outputsSCRIPTINPUTSOUTPUTSRUN ANALYSIS# Run Funcotator on a set of called variants. # # Description of inputs: # # Required: # String gatk_docker - GATK Docker image in which to run # File ref_fasta - Reference FASTA file. # File ref_fasta_index - Reference FASTA file index. # File ref_fasta_dict - Reference FASTA file sequence dictionary. # File variant_vcf_to_funcotate - Variant Context File (VCF) containing the variants to annotate. # File variant_vcf_to_funcotate_index - Index file corresponding to the input Variant Context File (VCF) containing the variants to annotate. # String reference_version - Version of the reference being used. Either `hg19` or `hg38`. # String output_file_name - Path to desired output file. # String output_format - Output file format (either VCF or MAF). # Boolean compress - Whether to compress the resulting output file. # Boolean use_gnomad - If true, will enable the gnomAD data sources in the data source tar.gz, if they exist. # # Optional: # File? interval_list - Intervals to be used for traversal. If specified will only traverse the given intervals. # File? data_sources_tar_gz - Path to tar.gz containing the data sources for Funcotator to create annotations. # String? transcript_selection_mode - Method of detailed transcript selection. This will select the transcript for detailed annotation (either `CANONICAL` or `BEST_EFFECT`). # Array[String]? transcript_selection_list - Set of transcript IDs to use for annotation to override selected transcript. # Array[String]? annotation_defaults - Annotations to include in all annotated variants if the annotation is not specified in the data sources (in the format <ANNOTATION>:<VALUE>). This will add the specified annotation to every annotated variant if it is not already present. # Array[String]? annotation_overrides - Override values for annotations (in the format <ANNOTATION>:<VALUE>). Replaces existing annotations of the given name with given values. # File? gatk4_jar_override - Override Jar file containing GATK 4.0. Use this when overriding the docker JAR or when using a backend without docker. # String? funcotator_extra_args - Extra command-line arguments to pass through to Funcotator. (e.g. " --exclude-field foo_field --exclude-field bar_field ") # # This WDL needs to decide whether to use the ``gatk_jar`` or ``gatk_jar_override`` for the jar location. As of cromwell-0.24, # this logic *must* go into each task. Therefore, there is a lot of duplicated code. This allows users to specify a jar file # independent of what is in the docker file. See the README.md for more info. # workflow Funcotator { String gatk_docker File ref_fasta File ref_fasta_index File ref_dict File variant_vcf_to_funcotate File variant_vcf_to_funcotate_index String reference_version String output_file_base_name String output_format Boolean compress Boolean use_gnomad File? interval_list File? data_sources_tar_gz String? transcript_selection_mode Array[String]? transcript_selection_list Array[String]? annotation_defaults Array[String]? annotation_overrides String? funcotator_extra_args File? gatk4_jar_override call Funcotate { input: gatk_docker = gatk_docker, ref_fasta = ref_fasta, ref_fasta_index = ref_fasta_index, ref_dict = ref_dict, input_vcf = variant_vcf_to_funcotate, input_vcf_idx = variant_vcf_to_funcotate_index, reference_version = reference_version, output_file_base_name = output_file_base_name, output_format = output_format, compress = compress, use_gnomad = use_gnomad, interval_list = interval_list, data_sources_tar_gz = data_sources_tar_gz, transcript_selection_mode = transcript_selection_mode, transcript_selection_list = transcript_selection_list, annotation_defaults = annotation_defaults, annotation_overrides = annotation_overrides, extra_args = funcotator_extra_args, gatk_override = gatk4_jar_override } output { File funcotated_file_out = Funcotate.funcotated_output_file File funcotated_file_out_idx = Funcotate.funcotated_output_file_index } } ################################################################################ task Funcotate { # ============== # Inputs File ref_fasta File ref_fasta_index File ref_dict File input_vcf File input_vcf_idx String reference_version String output_file_base_name String output_format Boolean compress Boolean use_gnomad # This should be updated when a new version of the data sources is released # TODO: Make this dynamically chosen in the command. File? data_sources_tar_gz = "gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.6.20190124s.tar.gz" String? control_id String? case_id String? sequencing_center String? sequence_source String? transcript_selection_mode File? transcript_selection_list Array[String]? annotation_defaults Array[String]? annotation_overrides Array[String]? funcotator_excluded_fields Boolean? filter_funcotations File? interval_list String? extra_args # ============== # Process input args: String output_maf = output_file_base_name + ".maf" String output_maf_index = output_maf + ".idx" String output_vcf = output_file_base_name + if compress then ".vcf.gz" else ".vcf" String output_vcf_idx = output_vcf + if compress then ".tbi" else ".idx" String output_file = if output_format == "MAF" then output_maf else output_vcf String output_file_index = if output_format == "MAF" then output_maf_index else output_vcf_idx String transcript_selection_arg = if defined(transcript_selection_list) then " --transcript-list " else "" String annotation_def_arg = if defined(annotation_defaults) then " --annotation-default " else "" String annotation_over_arg = if defined(annotation_overrides) then " --annotation-override " else "" String filter_funcotations_args = if defined(filter_funcotations) && (filter_funcotations) then " --remove-filtered-variants " else "" String excluded_fields_args = if defined(funcotator_excluded_fields) then " --exclude-field " else "" String interval_list_arg = if defined(interval_list) then " -L " else "" String extra_args_arg = select_first([extra_args, ""]) # ============== # Runtime options: String gatk_docker File? gatk_override Int? mem Int? preemptible_attempts Int? max_retries Int? disk_space_gb Int? cpu Boolean use_ssd = false # Mem is in units of GB but our command and memory runtime values are in MB Int default_ram_mb = 1024 * 3 Int machine_mem = if defined(mem) then mem *1024 else default_ram_mb Int command_mem = machine_mem - 1024 # Calculate disk size: Float ref_size_gb = size(ref_fasta, "GiB") + size(ref_fasta_index, "GiB") + size(ref_dict, "GiB") Float vcf_size_gb = size(input_vcf, "GiB") + size(input_vcf_idx, "GiB") Float ds_size_gb = size(data_sources_tar_gz, "GiB") Int default_disk_space_gb = ceil( ref_size_gb + (ds_size_gb * 2) + (vcf_size_gb * 10) ) + 20 # Silly hack to allow us to use the dollar sign in the command section: String dollar = "$" command <<< set -e export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override} # ======================================= # Hack to validate our WDL inputs: # # NOTE: This happens here so that we don't waste time copying down the data sources if there's an error. if [[ "${output_format}" != "MAF" ]] && [[ "${output_format}" != "VCF" ]] ; then echo "ERROR: Output format must be MAF or VCF." fi # ======================================= # Handle our data sources: # Extract the tar.gz: echo "Extracting data sources tar/gzip file..." mkdir datasources_dir tar zxvf ${data_sources_tar_gz} -C datasources_dir --strip-components 1 DATA_SOURCES_FOLDER="$PWD/datasources_dir" # Handle gnomAD: if ${use_gnomad} ; then echo "Enabling gnomAD..." for potential_gnomad_gz in gnomAD_exome.tar.gz gnomAD_genome.tar.gz ; do if [[ -f ${dollar}{DATA_SOURCES_FOLDER}/${dollar}{potential_gnomad_gz} ]] ; then cd ${dollar}{DATA_SOURCES_FOLDER} tar -zvxf ${dollar}{potential_gnomad_gz} cd - else echo "ERROR: Cannot find gnomAD folder: ${dollar}{potential_gnomad_gz}" 1>&2 false fi done fi # ======================================= # Run Funcotator: gatk --java-options "-Xmx${command_mem}m" Funcotator \ --data-sources-path $DATA_SOURCES_FOLDER \ --ref-version ${reference_version} \ --output-file-format ${output_format} \ -R ${ref_fasta} \ -V ${input_vcf} \ -O ${output_file} \ ${interval_list_arg} ${default="" interval_list} \ --annotation-default normal_barcode:${default="Unknown" control_id} \ --annotation-default tumor_barcode:${default="Unknown" case_id} \ --annotation-default Center:${default="Unknown" sequencing_center} \ --annotation-default source:${default="Unknown" sequence_source} \ ${"--transcript-selection-mode " + transcript_selection_mode} \ ${transcript_selection_arg}${default="" sep=" --transcript-list " transcript_selection_list} \ ${annotation_def_arg}${default="" sep=" --annotation-default " annotation_defaults} \ ${annotation_over_arg}${default="" sep=" --annotation-override " annotation_overrides} \ ${excluded_fields_args}${default="" sep=" --exclude-field " funcotator_excluded_fields} \ ${filter_funcotations_args} \ ${extra_args_arg} # ======================================= # Make sure we have a placeholder index for MAF files so this workflow doesn't fail: if [[ "${output_format}" == "MAF" ]] ; then touch ${output_maf_index} fi >>> runtime { docker: gatk_docker bootDiskSizeGb: 20 memory: machine_mem + " MB" disks: "local-disk " + select_first([disk_space_gb, default_disk_space_gb]) + if use_ssd then " SSD" else " HDD" preemptible: select_first([preemptible_attempts, 3]) maxRetries: select_first([max_retries, 0]) cpu: select_first([cpu, 1]) } output { File funcotated_output_file = "${output_file}" File funcotated_output_file_index = "${output_file_index}" } }
-
Hi Mia,
I'm seeing this in your .log file:
23:37:25.124 INFO Funcotator - Shutting down engine
[August 28, 2020 11:37:25 PM UTC] org.broadinstitute.hellbender.tools.funcotator.Funcotator done. Elapsed time: 197.71 minutes.
Runtime.totalMemory()=11615600640
java.lang.StringIndexOutOfBoundsException: String index out of range: 218
at java.lang.String.substring(String.java:1963)
at org.broadinstitute.hellbender.tools.funcotator.ProteinChangeInfo.initializeForInsertion(ProteinChangeInfo.java:256)
at org.broadinstitute.hellbender.tools.funcotator.ProteinChangeInfo.<init>(ProteinChangeInfo.java:93)
at org.broadinstitute.hellbender.tools.funcotator.ProteinChangeInfo.create(ProteinChangeInfo.java:371)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createSequenceComparison(GencodeFuncotationFactory.java:2010)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createCodingRegionFuncotationForProteinCodingFeature(GencodeFuncotationFactory.java:1200)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createExonFuncotation(GencodeFuncotationFactory.java:1051)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createGencodeFuncotationOnSingleTranscript(GencodeFuncotationFactory.java:985)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createFuncotationsHelper(GencodeFuncotationFactory.java:812)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createFuncotationsHelper(GencodeFuncotationFactory.java:796)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.lambda$createGencodeFuncotationsByAllTranscripts$0(GencodeFuncotationFactory.java:473)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createGencodeFuncotationsByAllTranscripts(GencodeFuncotationFactory.java:474)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createFuncotationsOnVariant(GencodeFuncotationFactory.java:529)
at org.broadinstitute.hellbender.tools.funcotator.DataSourceFuncotationFactory.determineFuncotations(DataSourceFuncotationFactory.java:233)
at org.broadinstitute.hellbender.tools.funcotator.DataSourceFuncotationFactory.createFuncotations(DataSourceFuncotationFactory.java:201)
at org.broadinstitute.hellbender.tools.funcotator.DataSourceFuncotationFactory.createFuncotations(DataSourceFuncotationFactory.java:172)
at org.broadinstitute.hellbender.tools.funcotator.FuncotatorEngine.lambda$createFuncotationMapForVariant$0(FuncotatorEngine.java:147)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.broadinstitute.hellbender.tools.funcotator.FuncotatorEngine.createFuncotationMapForVariant(FuncotatorEngine.java:157)
at org.broadinstitute.hellbender.tools.funcotator.Funcotator.enqueueAndHandleVariant(Funcotator.java:903)
at org.broadinstitute.hellbender.tools.funcotator.Funcotator.apply(Funcotator.java:857)
at org.broadinstitute.hellbender.engine.VariantWalker.lambda$traverse$0(VariantWalker.java:104)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)You may already be aware of this, but I wanted to point this out in case this is a genuine error you want to check for rather than bypassing the error code. I did a search for gatk funcotator java.lang.StringIndexOutOfBoundsException and found a couple of GATK forum posts where others seem to have run into similar issues.
The latter is by the user who also asked for help with setting up the continueOnReturnCode runtime element in their WDL. If you are certain you want to have this added to your WDL, I believe you will only need to add it to the runtime block for your Funcotator task. The user had to do a bit more because they had a default runtime that was provided to the task, but based on the WDL you shared, you should be able to just add the runtime attribute. See: https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/#continueonreturncode
For example:
runtime { docker: gatk_docker bootDiskSizeGb: 20 memory: machine_mem + " MB" disks: "local-disk " + select_first([disk_space_gb, default_disk_space_gb]) + if use_ssd then " SSD" else " HDD" preemptible: select_first([preemptible_attempts, 3]) maxRetries: select_first([max_retries, 0]) cpu: select_first([cpu, 1])
continueOnReturnCode: [0, 1] }Let me know if you have any further questions or concerns.
Kind regards,
Jason
-
Hi Jason,
Just an update: I removed the new variant that was failing and that seemed to have worked.
So overall, two variants failed due to different reasons, it is not ideal, but it is OK.
Of course, it would be great if this could be fixed in future.
Thank you for helping and best wishes,
Mia
-
Hi Mia,
I'm glad to hear you were able to get it working after removing the two variants. The Github issues section will be the best place to go to keep track of work done toward solving these bugs.
If there's anything else you're seeing that looks like a bug for investigation, let us know and we'll be happy to follow-up with the author(s).
Kind regards,
Jason
Please sign in to leave a comment.
27 comments