How to create Panel of Normals?
I am trying to call somatic SNVs using the following pipeline: https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2#article-comments
I am getting errors while creating panel of normals. Please let me know how do I create PON?
a) GATK version used:
gatk-4.2.5.0/
b) Exact command used:
[tbiswas@un04 ~]$ java -jar /home/tbiswas/softwares/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar CreateSomaticPanelOfNormals -R /home/tbiswas/hg19.fa --germline-resource /scratch/tbiswas/somatic-b37_af-only-gnomad.raw.sites.vcf -V gendb:/scratch/tbiswas/pon_db/ -O pon.vcf.gz
c) Entire program log:
[tbiswas@un04 ~]$ java -jar /home/tbiswas/softwares/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar CreateSomaticPanelOfNormals -R /home/tbiswas/hg19.fa --germline-resource /scratch/tbiswas/somatic-b37_af-only-gnomad.raw.sites.vcf -V gendb:/scratch/tbiswas/pon_db/ -O pon.vcf.gz
12:59:11.508 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/tbiswas/softwares/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Mar 25, 2022 12:59:14 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
12:59:14.877 INFO CreateSomaticPanelOfNormals - ------------------------------------------------------------
12:59:14.878 INFO CreateSomaticPanelOfNormals - The Genome Analysis Toolkit (GATK) v4.2.5.0
12:59:14.878 INFO CreateSomaticPanelOfNormals - For support and documentation go to https://software.broadinstitute.org/gatk/
12:59:14.880 INFO CreateSomaticPanelOfNormals - Executing as tbiswas@un04 on Linux v3.10.0-327.el7.x86_64 amd64
12:59:14.880 INFO CreateSomaticPanelOfNormals - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_65-b17
12:59:14.881 INFO CreateSomaticPanelOfNormals - Start Date/Time: 25 March, 2022 12:59:11 PM IST
12:59:14.881 INFO CreateSomaticPanelOfNormals - ------------------------------------------------------------
12:59:14.881 INFO CreateSomaticPanelOfNormals - ------------------------------------------------------------
12:59:14.881 INFO CreateSomaticPanelOfNormals - HTSJDK Version: 2.24.1
12:59:14.881 INFO CreateSomaticPanelOfNormals - Picard Version: 2.25.4
12:59:14.881 INFO CreateSomaticPanelOfNormals - Built for Spark Version: 2.4.5
12:59:14.881 INFO CreateSomaticPanelOfNormals - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:59:14.881 INFO CreateSomaticPanelOfNormals - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:59:14.881 INFO CreateSomaticPanelOfNormals - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:59:14.882 INFO CreateSomaticPanelOfNormals - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:59:14.882 INFO CreateSomaticPanelOfNormals - Deflater: IntelDeflater
12:59:14.882 INFO CreateSomaticPanelOfNormals - Inflater: IntelInflater
12:59:14.882 INFO CreateSomaticPanelOfNormals - GCS max retries/reopens: 20
12:59:14.882 INFO CreateSomaticPanelOfNormals - Requester pays: disabled
12:59:14.882 WARN CreateSomaticPanelOfNormals -
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Warning: CreateSomaticPanelOfNormals is a BETA tool and is not yet ready for use in production
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
12:59:14.882 INFO CreateSomaticPanelOfNormals - Initializing engine
12:59:25.618 INFO FeatureManager - Using codec VCFCodec to read file file:///scratch/tbiswas/somatic-b37_af-only-gnomad.raw.sites.vcf
Thank you.
Regards,
Tanay
-
Hi Tanay Biswas,
Thank you for writing in. From the program log that you shared, I'm not actually seeing any error. Is this the full stack trace from running CreateSomaticPanelOfNormals? The warning that you see about the tool being a BETA tool can be ignored.
Kind regards,
Pamela
-
Hi Pamela
Thanks for the reply. Yes, this is the full trace after running the code. But it did not generate any output file such as pon.vcf.gz. So, if there is not any such error or anything else, so what could be the possible solution for creating panel of normals? Is it possible that there was any problem while generating pon_db in the previous step? Because In that step the run stopped and showed the following:
terminate called after throwing an instance of 'GenomicsDBConfigException'
what(): GenomicsDBConfigException : Syntax error in JSON file /tmp/loader_3544340605349532363.json
Aborted (core dumped)Is there anything I should do?
Regards,
Tanay
-
Hi Tanay Biswas,
Based on this output from CreateSomaticPanelofNormals, it looks like it might not be finished running. If you haven't received any error or output, is it possible that the job is just still running? From the previous step, did you receive the output pon_db or did the run stop before this was generated? If this is the case, then I would recommend running the previous step with more memory.
Kind regards,
Pamela
-
Hi Pamela
I'm not seeing any job is running, the code finished running within no time. For the previous step I run the following code:
[tbiswas@un04 ~]$ java -jar /home/tbiswas/softwares/gatk-4.2.5.0/
gatk-package-4.2.5.0-local.jar GenomicsDBImport
-R /home/tbiswas/hg19.fa --genomicsdb-workspace-path
/scratch/tbiswas/pon_db2 --batch-size 1
-L /home/tbiswas/SureSelectV6_S07604514_hs_hg19/Covered.bed
-V /home/tbiswas/gatk_output/P4-BD.vcf.gz -V /home/tbiswas/gatk_output/P5-BD.vcf.gz
-V /home/tbiswas/gatk_output/P6-BD.vcf.gz -V /home/tbiswas/gatk_output/P8-BD.vcf.gz
-V /home/tbiswas/gatk_output/P12-BD.vcf.gz -V /home/tbiswas/gatk_output/P13-BD.vcf.gz
-V /home/tbiswas/gatk_output/P14-BD.vcf.gzThis code generated the output but with the following error for memory which I get to know that it is okay if we have data for only one chromosome.
14:02:53.727 INFO IntervalArgumentCollection - Processing 60456963 bp from intervals 14:02:54.039 WARN GenomicsDBImport - A large number of intervals were specified.
Using more than 100 intervals in a single import is not recommended and can cause performance to suffer.
If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with
the merge-input-intervals argument. 14:02:54.643 INFO GenomicsDBImport - Done initializing engine 14:02:56.849 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a 14:02:57.095 INFO GenomicsDBImport - Vid Map JSON file will be written to /scratch/tbiswas/pon_db2/vidmap.json 14:02:57.095 INFO GenomicsDBImport - Callset Map JSON file will be written to /scratch/tbiswas/pon_db2/callset.json 14:02:57.095 INFO GenomicsDBImport - Complete VCF Header will be written to /scratch/tbiswas/pon_db2/vcfheader.vcf 14:02:57.095 INFO GenomicsDBImport - Importing to workspace - /scratch/tbiswas/pon_db2 17:23:06.907 INFO GenomicsDBImport - Importing batch 1 with 1 samples . . 19:14:30.815 INFO GenomicsDBImport - Importing batch 1 with 1 samples terminate called after throwing an instance of 'GenomicsDBConfigException' what(): GenomicsDBConfigException : Syntax error in JSON file /tmp/loader_3544340605349532363.json Aborted (core dumped) [tbiswas@un04 ~]$But don't know how can I make sure the code to run without any error of this kind. Please suggest me what can I do to generate pon-db without any error?
Thank you so much.
Regards,
Tanay
-
Hi Tanay Biswas,
Okay, thank you for providing this. It does look like the issue is the GenomicsDBImport step which isn't producing the output that you are expecting. It looks like this is a memory error with the space being used as your temporary space. Here is some information from our tool docs regarding temp space with GenomicsDBImport:
GenomicsDBImport uses temporary disk storage during import. The amount of temporary disk storage required can exceed the space available, especially when specifying a large number of intervals. The command line argument `--tmp-dir` can be used to specify an alternate temporary storage location with sufficient space..
We have a document regarding optimizing with GenomicsDB, you can view it here: https://gatk.broadinstitute.org/hc/en-us/articles/360056138571-GDBI-usage-and-performance-guidelines
Please let me know if this information is helpful for running GenomicsDBImport successfully.
Kind regards,
Pamela
-
Hi Pamela
Thank you for the information. I've checked the provided links and run the following command:
[tbiswas@un04 ~]$ java -jar -Xmx8g /home/tbiswas/softwares/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar
GenomicsDBImport -R /home/tbiswas/hg19.fa --genomicsdb-workspace-path new_pon_db
--tmp-dir /home/tbiswas --batch-size 1 -L /home/tbiswas/SureSelectV6_S07604514_hs_hg19/Covered.bed
-V /home/tbiswas/gatk_output/P4-BD.vcf.gz -V /home/tbiswas/gatk_output/P5-BD.vcf.gz
-V /home/tbiswas/gatk_output/P6-BD.vcf.gz -V /home/tbiswas/gatk_output/P8-BD.vcf.gz
-V /home/tbiswas/gatk_output/P12-BD.vcf.gz -V /home/tbiswas/gatk_output/P13-BD.vcf.gz
-V /home/tbiswas/gatk_output/P14-BD.vcf.gz
15:53:57.761 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/tbiswas/softwares/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Mar 30, 2022 3:53:59 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
15:53:59.900 INFO GenomicsDBImport - ------------------------------------------------------------
15:53:59.900 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.5.0
15:53:59.900 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
15:53:59.997 INFO GenomicsDBImport - Executing as tbiswas@un04 on Linux v3.10.0-327.el7.x86_64 amd64
15:53:59.998 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_65-b17
15:53:59.998 INFO GenomicsDBImport - Start Date/Time: 30 March, 2022 3:53:57 PM IST
15:53:59.998 INFO GenomicsDBImport - ------------------------------------------------------------
15:53:59.998 INFO GenomicsDBImport - ------------------------------------------------------------
15:53:59.998 INFO GenomicsDBImport - HTSJDK Version: 2.24.1
15:53:59.999 INFO GenomicsDBImport - Picard Version: 2.25.4
15:53:59.999 INFO GenomicsDBImport - Built for Spark Version: 2.4.5
15:53:59.999 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:53:59.999 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:53:59.999 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:53:59.999 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:53:59.999 INFO GenomicsDBImport - Deflater: IntelDeflater
15:53:59.999 INFO GenomicsDBImport - Inflater: IntelInflater
15:53:59.999 INFO GenomicsDBImport - GCS max retries/reopens: 20
15:53:59.999 INFO GenomicsDBImport - Requester pays: disabled
15:53:59.999 INFO GenomicsDBImport - Initializing engine
15:54:08.056 INFO FeatureManager - Using codec BEDCodec to read file file:///home/tbiswas/SureSelectV6_S07604514_hs_hg19/Covered.bed
15:54:29.938 INFO IntervalArgumentCollection - Processing 60456963 bp from intervals
15:54:30.626 WARN GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
15:54:31.142 INFO GenomicsDBImport - Done initializing engine
15:54:36.712 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
15:54:36.811 INFO GenomicsDBImport - Vid Map JSON file will be written to /home/tbiswas/new_pon_db/vidmap.json
15:54:36.811 INFO GenomicsDBImport - Callset Map JSON file will be written to /home/tbiswas/new_pon_db/callset.json
15:54:36.811 INFO GenomicsDBImport - Complete VCF Header will be written to /home/tbiswas/new_pon_db/vcfheader.vcf
15:54:36.812 INFO GenomicsDBImport - Importing to workspace - /home/tbiswas/new_pon_db
18:33:24.251 INFO GenomicsDBImport - Importing batch 1 with 1 samples
.
.
21:45:13.771 INFO GenomicsDBImport - Importing batch 1 with 1 samples
[TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error path=new_pon_db/chr1$1646244$1646363/genomicsdb_meta_dir/genomicsdb_column_bounds.json errno=122(Disk quota exceeded)
21:45:35.201 erro NativeGenomicsDB - pid=40987 tid=25560 VariantStorageManagerException exception : Could not write to column bounds file
TileDB error message : [TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error path=new_pon_db/chr1$1646244$1646363/genomicsdb_meta_dir/genomicsdb_column_bounds.json errno=122(Disk quota exceeded)
terminate called after throwing an instance of 'VariantStorageManagerException'
what(): VariantStorageManagerException exception : Could not write to column bounds file
TileDB error message : [TileDB::FileSystem] Error: (write_to_file) Cannot write to file; File writing error path=new_pon_db/chr1$1646244$1646363/genomicsdb_meta_dir/genomicsdb_column_bounds.json errno=122(Disk quota exceeded)
Aborted (core dumped)
[tbiswas@un04 ~]$This generated the above error which is different from the previous one. My disk (HPC) has the space of 223TB and it is showing that the disk is not full yet. So, how to solve this issue?
Please let me know what can be done.
-
Hi Tanay Biswas,
Thank you for taking a look at those documents and trying this out. Could you please try increasing the specified memory for the GenomicsDBImport job using the -xmx option? Given that you have a lot more available space on your disk, you can specify quite a bit more memory to the job. Please let me know if this is successful.
Kind regards,
Pamela
Please sign in to leave a comment.
7 comments