Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Error in running GenomicsDBConfigException

Answered
0

12 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Shivangi Agarwal,

    Could you share your complete program log and also the GATK version number?

    Thanks,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Shivangi Agarwal

    Hi

    GATK version is 4.2.2.0.

    Below is the command and its log.

    java -jar $GenomeAnalysisTK GenomicsDBImport -V aln_501_trimmed_again_RG.g.vcf -V aln_502_trimmed_again_RG.g.vcf --genomicsdb-workspace-path my_database --intervals chr1.bed.

    16:52:22.602 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/shivangiagarwal/Downloads/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Dec 18, 2021 4:52:22 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    16:52:22.735 INFO GenomicsDBImport - ------------------------------------------------------------
    16:52:22.736 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.2.0
    16:52:22.736 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
    16:52:22.736 INFO GenomicsDBImport - Executing as shivangiagarwal@gayatry-PowerEdge-T640 on Linux v5.11.0-40-generic amd64
    16:52:22.736 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v11.0.11+9-Ubuntu-0ubuntu2
    16:52:22.736 INFO GenomicsDBImport - Start Date/Time: December 18, 2021 at 4:52:22 PM CST
    16:52:22.736 INFO GenomicsDBImport - ------------------------------------------------------------
    16:52:22.736 INFO GenomicsDBImport - ------------------------------------------------------------
    16:52:22.737 INFO GenomicsDBImport - HTSJDK Version: 2.24.1
    16:52:22.737 INFO GenomicsDBImport - Picard Version: 2.25.4
    16:52:22.737 INFO GenomicsDBImport - Built for Spark Version: 2.4.5
    16:52:22.737 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    16:52:22.737 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    16:52:22.737 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    16:52:22.738 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    16:52:22.738 INFO GenomicsDBImport - Deflater: IntelDeflater
    16:52:22.738 INFO GenomicsDBImport - Inflater: IntelInflater
    16:52:22.738 INFO GenomicsDBImport - GCS max retries/reopens: 20
    16:52:22.738 INFO GenomicsDBImport - Requester pays: disabled
    16:52:22.738 INFO GenomicsDBImport - Initializing engine
    16:52:23.474 INFO FeatureManager - Using codec BEDCodec to read file file:///media/shivangiagarwal/DATA/PrCa/S31285117_Covered.bed
    16:52:25.738 INFO IntervalArgumentCollection - Processing 49475726 bp from intervals
    16:52:25.794 WARN GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
    16:52:25.796 INFO GenomicsDBImport - Done initializing engine
    16:52:26.128 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.1-d59e886
    16:52:26.145 INFO GenomicsDBImport - Vid Map JSON file will be written to /home/shivangiagarwal/my_database-2/vidmap.json
    16:52:26.146 INFO GenomicsDBImport - Callset Map JSON file will be written to /home/shivangiagarwal/my_database-2/callset.json
    16:52:26.146 INFO GenomicsDBImport - Complete VCF Header will be written to /home/shivangiagarwal/my_database-2/vcfheader.vcf
    16:52:26.146 INFO GenomicsDBImport - Importing to workspace - /home/shivangiagarwal/my_database-2
    17:19:32.297 INFO GenomicsDBImport - Importing batch 1 with 2 samples
    17:19:34.502 INFO GenomicsDBImport - Importing batch 1 with 2 samples
    17:19:36.360 INFO GenomicsDBImport - Importing batch 1 with 2 samples
    17:19:38.525 INFO GenomicsDBImport - Importing batch 1 with 2 samples
    17:19:40.599 INFO GenomicsDBImport - Importing batch 1 with 2 samples
    17:19:42.947 INFO GenomicsDBImport - Importing batch 1 with 2 samples
    17:19:44.995 INFO GenomicsDBImport - Importing batch 1 with 2 samples
    17:19:46.992 INFO GenomicsDBImport - Importing batch 1 with 2 samples
    [...]
    18:42:07.581 INFO GenomicsDBImport - Importing batch 1 with 2 samples
    18:42:09.419 INFO GenomicsDBImport - Importing batch 1 with 2 samples
    terminate called after throwing an instance of 'GenomicsDBConfigException'
    what(): GenomicsDBConfigException : Syntax error in JSON file /tmp/loader_5890342525445890988.json
    Aborted (core dumped)

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Shivangi Agarwal,

    Thank you! I updated your comment just for readability but it is good to see the whole log!

    It's challenging to determine what could be the cause of the issue with the /tmp/loader_5890342525445890988.json file. I'm thinking most likely there is a mix up with your sample names (see this similar issue ticket). Or, that your temp directory is running out of space or number of file handles that can be open at once. You can explicitly specify a temporary directory with the option --tmp-dir. (See this article for more information about GenomicsDB usage).

    One other thought is that this could be a strange issue arising from how you are running the command. With GATK4, we recommend that you use the GATK wrapper script when submitting commands: https://gatk.broadinstitute.org/hc/en-us/articles/360035531892-GATK4-command-line-syntax.

    Could you see what is in the /tmp/loader_5890342525445890988.json file for more information?

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Shivangi Agarwal

    Hi,

    Ya, I can see many tmp json files created in tmp directory (total size-250GB).

    I observed that as the full disk space is used, I get this error (as mentioned above) along with the pop up as "No disk space".

    Now, I am again running the command specifying temp directory location in storage drive with 12 TB of space remaining.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Okay, those are great observations! Let me know if it is successful. 

    0
    Comment actions Permalink
  • Avatar
    Shivangi Agarwal

    Hi Genevieve,

    I put GenomicsDBImport on run on Monday, but my system (125.8 GiB) got freezed and finally I had to restart it. 

    Then, I again put it on run on another system (256 GiB) and it is still running (more than 24 hrs). Also, I noticed that temporary files of around 2.2 TB has been generated till now (command below).

    java -jar $GenomeAnalysisTK GenomicsDBImport -V aln_501_trimmed_again_RG.g.vcf -V aln_502_trimmed_again_RG.g.vcf --genomicsdb-workspace-path my_database --intervals S31285117_Covered.bed --tmp-dir ./temp

    So, I am wondering how much time and space it requires? I have to call variants for 48 samples.

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Yes, speeding up GenomicsDBImport is a common question. We have an article containing our recommendations here: https://gatk.broadinstitute.org/hc/en-us/articles/360056138571-GenomicsDBImport-usage-and-performance-guidelines

    I also recommend that you submit your GATK command line with the gatk wrapper script: https://gatk.broadinstitute.org/hc/en-us/articles/360035531892-GATK4-command-line-syntax

    Please let me know if you have further questions. 

    0
    Comment actions Permalink
  • Avatar
    Shivangi Agarwal

    Hi Genevieve,

    I want to call variants for my samples which are divided into four categories as normal, stage I, stage II and stage III.

    My understanding till now after going through multiple documentation is that I need to create a database for Normal samples (PON) (https://gatk.broadinstitute.org/hc/en-us/articles/360037058172-CreateSomaticPanelOfNormals-BETA-).

    How to proceed further to call variants comparing my normal samples to different tumor stage samples ?

    0
    Comment actions Permalink
  • Avatar
    Shivangi Agarwal

    Hi Genevieve,

     

    Can you please respond to the above query.

     

    Thanks

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Shivangi Agarwal,

    If all your samples are from the same individual, you can do multi sample calling with Mutect2. This will help get more high quality results. There is a forum post discussion going into details about multi-sample calling that you can read for more information. You can then use the final VCF and the read depth for each sample to determine if any variants are a part of tumor progression.

    If you have 40 normal samples, you can create your own panel of normals to be used with Mutect2. If not, you can use our publicly available panel of normals created from the 1000 genomes project data. We recommend that you use a panel of normals with the --panel-of-normals Mutect2 argument and a corresponding normal sample for each tumor sample with the -normal argument for best results. 

    Please let me know if you have further questions regarding this query.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Shivangi Agarwal

    Hi Genevieve,

    I have a few questions regarding my GATK run set up. These are as:

    1. I have 12 normal samples. Can I make my own PON taking those twelve ? I mean is it allowed or recommended ? Does this make sense ?  

    2. While, I tried to create my own PON taking 12 samples using commands as below:

    !gatk Mutect2 -R reference.fasta -I normal1.bam --max-mnp-distance 0 -O normal1.vcf.gz 
    !gatk Mutect2 -R reference.fasta -I normal1.bam --max-mnp-distance 0 -O normal2.vcf.gz
    !gatk GenomicsDBImport -R reference.fasta -L hg19.bed --genomicsdb-workspace-path pon_db -V normal1.vcf.gz -V normal2.vcf.gz
    

    but I am getting error like : 

    "A USER ERROR has occured: Duplicate sample: $i. Sample was found in both file://normal.vcf.gz and normal2.vcf.gz

    How can I fix this ?

    3. To run Mutect2 (below command) , I will also need --germline-resource af-only-gnomad.vcf.gz. How to get this for hg19 ?

        gatk Mutect2 \
         -R reference.fa \
         -I tumor.bam \
         -I normal.bam \
         -normal normal_sample_name \
         --germline-resource af-only-gnomad.vcf.gz \
         --panel-of-normals pon.vcf.gz \
         -O somatic.vcf.gz

    4. I am just curious to know that, If I call somatic variants separately for each sample (using mutect2 tumor only mode) whether normal or tumor and then do overlap study among resulted VCF files using my own python script to compare variants in tumor vs normal VS If I run mutect2 in tumor-normal mode (https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2) to get variants present in tumor (which are already nullifying the ones present in normal).

    Does this make any difference to the results ? 

    Please spare some time to answer my queries. I would really appreciate it. 

     

    Thanks,

    Shivangi

     

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Shivangi,

    1. No, it's not recommended to make a PON with only 12 samples. We recommend a minimum of 40: https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON-. If not, we recommend you use our public PON.
    2. You will need to name all your samples different names if you are getting this error. Check out the tool RenameSampleInVcf.
    3. All our resources available are outlined in the resource bundle page: https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle
    4. Tumor-normal mode is much better than tumor-only mode. I would recommend tumor-normal mode for your main analysis, but you can always check out tumor-only if you are curious. Tumor-only will have many false positives.

    Best,

    Genevieve

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk