Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

How to create Panel of Normals?

0

7 comments

  • Avatar
    Pamela Bretscher

    Hi Tanay Biswas,

    Thank you for writing in. From the program log that you shared, I'm not actually seeing any error. Is this the full stack trace from running CreateSomaticPanelOfNormals? The warning that you see about the tool being a BETA tool can be ignored.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Tanay Biswas
    Hi Pamela

    Thanks for the reply. Yes, this is the full trace after running the code. But it did not generate any output file such as pon.vcf.gz. So, if there is not any such error or anything else, so what could be the possible solution for creating panel of normals? Is it possible that there was any problem while generating pon_db in the previous step? Because In that step the run stopped and showed the following:

    terminate called after throwing an instance of 'GenomicsDBConfigException'
      what():  GenomicsDBConfigException : Syntax error in JSON file /tmp/loader_3544340605349532363.json
    Aborted (core dumped)

     

    Is there anything I should do?

    Regards,

    Tanay

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Tanay Biswas,

    Based on this output from CreateSomaticPanelofNormals, it looks like it might not be finished running. If you haven't received any error or output, is it possible that the job is just still running? From the previous step, did you receive the output pon_db or did the run stop before this was generated? If this is the case, then I would recommend running the previous step with more memory.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Tanay Biswas
    Hi Pamela

    I'm not seeing any job is running, the code finished running within no time. For the previous step I run the following code:

    [tbiswas@un04 ~]$ java -jar /home/tbiswas/softwares/gatk-4.2.5.0/
    gatk-package-4.2.5.0-local.jar GenomicsDBImport
    -R /home/tbiswas/hg19.fa --genomicsdb-workspace-path
    /scratch/tbiswas/pon_db2 --batch-size 1
    -L /home/tbiswas/SureSelectV6_S07604514_hs_hg19/Covered.bed
    -V /home/tbiswas/gatk_output/P4-BD.vcf.gz -V /home/tbiswas/gatk_output/P5-BD.vcf.gz
    -V /home/tbiswas/gatk_output/P6-BD.vcf.gz -V /home/tbiswas/gatk_output/P8-BD.vcf.gz
    -V /home/tbiswas/gatk_output/P12-BD.vcf.gz -V /home/tbiswas/gatk_output/P13-BD.vcf.gz
    -V /home/tbiswas/gatk_output/P14-BD.vcf.gz

    This code generated the output but with the following error for memory which I get to know that it is okay if we have data for only one chromosome.

    14:02:53.727 INFO  IntervalArgumentCollection - Processing 60456963 bp from intervals
    14:02:54.039 WARN  GenomicsDBImport - A large number of intervals were specified.
    Using more than 100 intervals in a single import is not recommended and can cause performance to suffer.
    If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with
    the merge-input-intervals argument. 14:02:54.643 INFO GenomicsDBImport - Done initializing engine 14:02:56.849 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a 14:02:57.095 INFO GenomicsDBImport - Vid Map JSON file will be written to /scratch/tbiswas/pon_db2/vidmap.json 14:02:57.095 INFO GenomicsDBImport - Callset Map JSON file will be written to /scratch/tbiswas/pon_db2/callset.json 14:02:57.095 INFO GenomicsDBImport - Complete VCF Header will be written to /scratch/tbiswas/pon_db2/vcfheader.vcf 14:02:57.095 INFO GenomicsDBImport - Importing to workspace - /scratch/tbiswas/pon_db2 17:23:06.907 INFO GenomicsDBImport - Importing batch 1 with 1 samples . . 19:14:30.815 INFO GenomicsDBImport - Importing batch 1 with 1 samples terminate called after throwing an instance of 'GenomicsDBConfigException' what(): GenomicsDBConfigException : Syntax error in JSON file /tmp/loader_3544340605349532363.json Aborted (core dumped) [tbiswas@un04 ~]$

    But don't know how can I make sure the code to run without any error of this kind. Please suggest me what can I do to generate pon-db without any error?

    Thank you so much.

     

    Regards,

    Tanay

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Tanay Biswas,

    Okay, thank you for providing this. It does look like the issue is the GenomicsDBImport step which isn't producing the output that you are expecting. It looks like this is a memory error with the space being used as your temporary space. Here is some information from our tool docs regarding temp space with GenomicsDBImport:

    GenomicsDBImport uses temporary disk storage during import. The amount of temporary disk storage required can exceed the space available, especially when specifying a large number of intervals. The command line argument `--tmp-dir` can be used to specify an alternate temporary storage location with sufficient space..

    We have a document regarding optimizing with GenomicsDB, you can view it here: https://gatk.broadinstitute.org/hc/en-us/articles/360056138571-GDBI-usage-and-performance-guidelines

    Please let me know if this information is helpful for running GenomicsDBImport successfully.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Tanay Biswas
    Hi Pamela 

    Thank you for the information. I've checked the provided links and run the following command:

    [tbiswas@un04 ~]$ java -jar -Xmx8g /home/tbiswas/softwares/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar
    GenomicsDBImport -R /home/tbiswas/hg19.fa --genomicsdb-workspace-path new_pon_db
    --tmp-dir /home/tbiswas --batch-size 1 -L /home/tbiswas/SureSelectV6_S07604514_hs_hg19/Covered.bed
    -V /home/tbiswas/gatk_output/P4-BD.vcf.gz -V /home/tbiswas/gatk_output/P5-BD.vcf.gz
    -V /home/tbiswas/gatk_output/P6-BD.vcf.gz -V /home/tbiswas/gatk_output/P8-BD.vcf.gz
    -V /home/tbiswas/gatk_output/P12-BD.vcf.gz -V /home/tbiswas/gatk_output/P13-BD.vcf.gz
    -V /home/tbiswas/gatk_output/P14-BD.vcf.gz
    15:53:57.761 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/tbiswas/softwares/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Mar 30, 2022 3:53:59 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    15:53:59.900 INFO  GenomicsDBImport - ------------------------------------------------------------
    15:53:59.900 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.5.0
    15:53:59.900 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
    15:53:59.997 INFO  GenomicsDBImport - Executing as tbiswas@un04 on Linux v3.10.0-327.el7.x86_64 amd64
    15:53:59.998 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_65-b17
    15:53:59.998 INFO  GenomicsDBImport - Start Date/Time: 30 March, 2022 3:53:57 PM IST
    15:53:59.998 INFO  GenomicsDBImport - ------------------------------------------------------------
    15:53:59.998 INFO  GenomicsDBImport - ------------------------------------------------------------
    15:53:59.998 INFO  GenomicsDBImport - HTSJDK Version: 2.24.1
    15:53:59.999 INFO  GenomicsDBImport - Picard Version: 2.25.4
    15:53:59.999 INFO  GenomicsDBImport - Built for Spark Version: 2.4.5
    15:53:59.999 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    15:53:59.999 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    15:53:59.999 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    15:53:59.999 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    15:53:59.999 INFO  GenomicsDBImport - Deflater: IntelDeflater
    15:53:59.999 INFO  GenomicsDBImport - Inflater: IntelInflater
    15:53:59.999 INFO  GenomicsDBImport - GCS max retries/reopens: 20
    15:53:59.999 INFO  GenomicsDBImport - Requester pays: disabled
    15:53:59.999 INFO  GenomicsDBImport - Initializing engine
    15:54:08.056 INFO  FeatureManager - Using codec BEDCodec to read file file:///home/tbiswas/SureSelectV6_S07604514_hs_hg19/Covered.bed
    15:54:29.938 INFO  IntervalArgumentCollection - Processing 60456963 bp from intervals
    15:54:30.626 WARN  GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
    15:54:31.142 INFO  GenomicsDBImport - Done initializing engine
    15:54:36.712 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
    15:54:36.811 INFO  GenomicsDBImport - Vid Map JSON file will be written to /home/tbiswas/new_pon_db/vidmap.json
    15:54:36.811 INFO  GenomicsDBImport - Callset Map JSON file will be written to /home/tbiswas/new_pon_db/callset.json
    15:54:36.811 INFO  GenomicsDBImport - Complete VCF Header will be written to /home/tbiswas/new_pon_db/vcfheader.vcf
    15:54:36.812 INFO  GenomicsDBImport - Importing to workspace - /home/tbiswas/new_pon_db
    18:33:24.251 INFO  GenomicsDBImport - Importing batch 1 with 1 samples
    .
    .
    21:45:13.771 INFO  GenomicsDBImport - Importing batch 1 with 1 samples
    [TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error path=new_pon_db/chr1$1646244$1646363/genomicsdb_meta_dir/genomicsdb_column_bounds.json errno=122(Disk quota exceeded)
    21:45:35.201 erro  NativeGenomicsDB - pid=40987 tid=25560 VariantStorageManagerException exception : Could not write to column bounds file
    TileDB error message : [TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error path=new_pon_db/chr1$1646244$1646363/genomicsdb_meta_dir/genomicsdb_column_bounds.json errno=122(Disk quota exceeded)
    terminate called after throwing an instance of 'VariantStorageManagerException'
      what():  VariantStorageManagerException exception : Could not write to column bounds file
    TileDB error message : [TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error path=new_pon_db/chr1$1646244$1646363/genomicsdb_meta_dir/genomicsdb_column_bounds.json errno=122(Disk quota exceeded)
    Aborted (core dumped)
    [tbiswas@un04 ~]$ 

     

    This generated the above error which is different from the previous one. My disk (HPC) has the space of 223TB and it is showing that the disk is not full yet. So, how to solve this issue?

    Please let me know what can be done.

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Tanay Biswas,

    Thank you for taking a look at those documents and trying this out. Could you please try increasing the specified memory for the GenomicsDBImport job using the -xmx option? Given that you have a lot more available space on your disk, you can specify quite a bit more memory to the job. Please let me know if this is successful.

    Kind regards,

    Pamela

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk