Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Running GATK funcotator on cluster

0

10 comments

  • Avatar
    Amit

    Hi Bhanu,

    Gentle reminder.

    Could you please reply to my above query. Let me know if you need more information.

    Amit

    0
    Comment actions Permalink
  • Avatar
    Amit

    Hi GATK Development Team,

    When do you think I can expect to hear from you on my query? 

    Amit

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Amit

     

    Apologies for the delay in getting back to you. We have been facing a huge volume of questions these last couple of months, which is why there have been delays on the forum. 

     

    I am looking into this issue now and will get back to you shortly.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Amit

    Looks like you mismatched the references and data sources versions:

     

    gatk --java-options -Xmx8g -Xmx4g Funcotator \

    -R UCSCWholeGenomeFastaHG19/genome.fa \

    -V inputfile,vcf \

    -O outputfile.vcf \

    --output-file-format VCF \

    --data-sources-path dataSourcesFolderHG19/ \

    --ref-version hg38

     

    Let us know if the issue persists after making the correction and if you do please also provide the exact command you using to run this locally.

    0
    Comment actions Permalink
  • Avatar
    Amit

    Hi Bhanu, Sorry about the confusion due to incorrect reference information in my command.

    This command is run within a snakemake workflow where the values for options (-R, --data-sources-path and --ref-version) are imported from a config file. I made a mistake while manually writing the values down in the command in my original query. However, I have double checked the config file and it has correct values. So, please consider below command.

    gatk --java-options -Xmx8g -Xmx4g Funcotator \

    -R UCSCWholeGenomeFastaHG38/hg38.fa \

    -V inputfile.vcf \

    -O outputfile.vcf \

    --output-file-format VCF \

    --data-sources-path dataSourcesFolderHG38/ \

    --ref-version hg38

    Also, please note that this happens only when I am submitting jobs to nodes on HPC cluster. This command runs fine on the login node of cluster or locally on a Linux machine. Do let me know your thoughts. 

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Amit

     

    1. The log file shared above, is that the entire log? if not please share the entire log file.
    2. Can you please run with stacktrace enabled and share the log with us?
    3. Have you enabled the gnomad datasources? Funcotator doesn't try to access the cloud unless gnomad is enabled. If it is enabled then that might be the problem and it might have to do with the way your HPC is setup.
     
     
    0
    Comment actions Permalink
  • Avatar
    Amit

     

    Hi Bhanu. Please see my comments below in italic. 

    • The log file shared above, is that the entire log? if not please share the entire log file.

             - That is the entire log. The analysis gets stuck at that point (as it tries to access cloud-based datasources) and times out!

    • Can you please run with stacktrace enabled and share the log with us?

            - I will work on this and get back to you.

    • Have you enabled the gnomad datasources? Funcotator doesn't try to access the cloud unless gnomad is enabled. If it is enabled then that might be the problem and it might have to do with the way your HPC is setup.

             - Funcotator can efficiently work on login node of HPC (and locally on Linux machine) that means datasources are enabled unless there is a different way to enable them to be run on HPC nodes. If there is, can you please let me know how?

            -I have been working with HPC team and it doesn't look like there are any issues on their end.

    Please let me know your thoughts. 

     

    0
    Comment actions Permalink
  • Avatar
    Amit

    Hi Bhanu,

    Here is a quick update. 

    • Can you please run with stacktrace enabled and share the log with us?

            - Log stays the same and the process times out!

    Look forward to your thoughts.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Amit

     

    Can you tell me if the datasources are copied to each node or if they're on one central copy that all nodes are trying to access? It seems to be hanging on where it's trying to read in the datasources. We think it might be an NFS / network storage related issue.
    0
    Comment actions Permalink
  • Avatar
    Amit

    The data sources are on central location that all nodes are trying to access. I will check with HPC team again.

    Thank you!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk