Running GATK funcotator on cluster
Hello Bhanu,
I am creating a new post upon your request. Also including some additional details that could be helpful in finding a solution.
I am trying to run a snakemake workflow using GATK 4.1.0 on HPC. HaplotypeCaller and Mutect2 runs fine, however Funcotator refuses to run fully and gets timed out. Quite interestingly Funcotator runs fine when I run on HPC login node. I suspect this has something to do with the datasources on cloud that Funcotator is trying to access.
Could you please share your thoughts? Below are commands used and error logs.
Command Used (A slurm script runs a pipeline with this command on HPC)
gatk --java-options -Xmx8g -Xmx4g Funcotator \
-R UCSCWholeGenomeFastaHG19/genome.fa \
-V inputfile,vcf \
-O outputfile.vcf \
--output-file-format VCF \
--data-sources-path dataSourcesFolderHG19/ \
--ref-version hg38
Error Log
20:17:09.137 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/sw/med/centos7/gatk/4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
20:17:10.849 INFO Funcotator - ------------------------------------------------------------
20:17:10.849 INFO Funcotator - The Genome Analysis Toolkit (GATK) v4.1.0.0
20:17:10.849 INFO Funcotator - For support and documentation go to https://software.broadinstitute.org/gatk/
20:17:10.849 INFO Funcotator - Executing as XYZ on Linux v3.10.0-1062.9.1.el7.x86_64 amd64
20:17:10.849 INFO Funcotator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_232-b09
20:17:10.850 INFO Funcotator - Start Date/Time: May 26, 2020 8:17:09 PM EDT
20:17:10.850 INFO Funcotator - ------------------------------------------------------------
20:17:10.850 INFO Funcotator - ------------------------------------------------------------
20:17:10.850 INFO Funcotator - HTSJDK Version: 2.18.2
20:17:10.850 INFO Funcotator - Picard Version: 2.18.25
20:17:10.850 INFO Funcotator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
20:17:10.850 INFO Funcotator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
20:17:10.850 INFO Funcotator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
20:17:10.850 INFO Funcotator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
20:17:10.850 INFO Funcotator - Deflater: IntelDeflater
20:17:10.850 INFO Funcotator - Inflater: IntelInflater
20:17:10.850 INFO Funcotator - GCS max retries/reopens: 20
20:17:10.850 INFO Funcotator - Requester pays: disabled
20:17:10.850 INFO Funcotator - Initializing engine
20:17:11.232 INFO FeatureManager - Using codec VCFCodec to read file file:///CallVars/VCF/A_somatic.vcf
20:17:11.281 INFO Funcotator - Done initializing engine
20:17:11.281 INFO Funcotator - Validating Sequence Dictionaries...
20:17:11.286 INFO Funcotator - Processing user transcripts/defaults/overrides...
20:17:11.287 INFO Funcotator - Initializing data sources...
20:17:11.288 INFO DataSourceUtils - Initializing data sources from directory: dataSourcesFolderHG38/
20:17:11.289 WARN DataSourceUtils - Could not read MANIFEST.txt: unable to log data sources version information.
20:17:11.294 INFO DataSourceUtils - Resolved data source file path: file:///CallVars/clinvar_20180429_hg38.vcf -> file:///dataSourcesFolderHG38/clinvar/hg38/clinvar_20180429_hg38.vcf
20:17:11.296 INFO DataSourceUtils - Resolved data source file path: file:///CallVars/gencode.v28.annotation.REORDERED.gtf -> file:///CallVars/dataSourcesFolderHG38/gencode/hg38/gencode.v28.annotation.REORDERED.gtf
20:17:11.297 INFO DataSourceUtils - Resolved data source file path: file:///CallVars/gencode.v28.pc_transcripts.fa -> file:///CallVars/dataSourcesFolderHG38/gencode/hg38/gencode.v28.pc_transcripts.fa
-
Hi Bhanu,
Gentle reminder.
Could you please reply to my above query. Let me know if you need more information.
Amit
-
Hi GATK Development Team,
When do you think I can expect to hear from you on my query?
Amit
-
Hi Amit
Apologies for the delay in getting back to you. We have been facing a huge volume of questions these last couple of months, which is why there have been delays on the forum.
I am looking into this issue now and will get back to you shortly.
-
Hi Amit
Looks like you mismatched the references and data sources versions:
gatk --java-options -Xmx8g -Xmx4g Funcotator \
-R UCSCWholeGenomeFastaHG19/genome.fa \
-V inputfile,vcf \
-O outputfile.vcf \
--output-file-format VCF \
--data-sources-path dataSourcesFolderHG19/ \
--ref-version hg38
Let us know if the issue persists after making the correction and if you do please also provide the exact command you using to run this locally.
-
Hi Bhanu, Sorry about the confusion due to incorrect reference information in my command.
This command is run within a snakemake workflow where the values for options (-R, --data-sources-path and --ref-version) are imported from a config file. I made a mistake while manually writing the values down in the command in my original query. However, I have double checked the config file and it has correct values. So, please consider below command.
gatk --java-options -Xmx8g -Xmx4g Funcotator \
-R UCSCWholeGenomeFastaHG38/hg38.fa \
-V inputfile.vcf \
-O outputfile.vcf \
--output-file-format VCF \
--data-sources-path dataSourcesFolderHG38/ \
--ref-version hg38
Also, please note that this happens only when I am submitting jobs to nodes on HPC cluster. This command runs fine on the login node of cluster or locally on a Linux machine. Do let me know your thoughts.
-
Hi Amit
- The log file shared above, is that the entire log? if not please share the entire log file.
- Can you please run with stacktrace enabled and share the log with us?
- Have you enabled the gnomad datasources? Funcotator doesn't try to access the cloud unless gnomad is enabled. If it is enabled then that might be the problem and it might have to do with the way your HPC is setup.
-
Hi Bhanu. Please see my comments below in italic.
- The log file shared above, is that the entire log? if not please share the entire log file.
- That is the entire log. The analysis gets stuck at that point (as it tries to access cloud-based datasources) and times out!
- Can you please run with stacktrace enabled and share the log with us?
- I will work on this and get back to you.
- Have you enabled the gnomad datasources? Funcotator doesn't try to access the cloud unless gnomad is enabled. If it is enabled then that might be the problem and it might have to do with the way your HPC is setup.
- Funcotator can efficiently work on login node of HPC (and locally on Linux machine) that means datasources are enabled unless there is a different way to enable them to be run on HPC nodes. If there is, can you please let me know how?
-I have been working with HPC team and it doesn't look like there are any issues on their end.
Please let me know your thoughts.
-
Hi Bhanu,
Here is a quick update.
- Can you please run with stacktrace enabled and share the log with us?
- Log stays the same and the process times out!
Look forward to your thoughts.
-
Hi Amit
Can you tell me if the datasources are copied to each node or if they're on one central copy that all nodes are trying to access? It seems to be hanging on where it's trying to read in the datasources. We think it might be an NFS / network storage related issue. -
The data sources are on central location that all nodes are trying to access. I will check with HPC team again.
Thank you!
Please sign in to leave a comment.
10 comments