missing file: Callset.json , when creating PON
AnsweredIn the second step of creating a PON from 50 normal mutect calls, The processing goes well, no error reported, but one file is missing in the output directory: Callset.json
a) GATK version used: 4.2.2.0
b) Exact command used:
gatk GenomicsDBImport -R ${Reference_genome} \
-L ${CaptureKitFile} --genomicsdb-workspace-path /path/to/PON_db \
-V ${Output_directory}/sample0.vcf.gz -V ${Output_directory}/sample1.vcf.gz \
-V ${Output_directory}/sample2.vcf.gz -V ${Output_directory}/sample3.vcf.gz \
-V ${Output_directory}/sample4.vcf.gz -V ..... etc etc \
-V ${Output_directory}/sample50.vcf.gz
c) Entire output log: (there is no error reported)
11:59:56.072 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user/miniconda3/envs/GATKSomatic/share/gatk4-4.2.2.0-1/gatk-package-4.2.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Oct 22, 2021 11:59:56 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
11:59:56.599 INFO GenomicsDBImport - ------------------------------------------------------------
11:59:56.599 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.2.0
11:59:56.599 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
11:59:56.603 INFO GenomicsDBImport - Executing as user@hpc on Linux v3.10.0-1160.36.2.el7.x86_64 amd64
11:59:56.603 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
11:59:56.603 INFO GenomicsDBImport - Start Date/Time: October 22, 2021 11:59:55 AM CEST
11:59:56.603 INFO GenomicsDBImport - ------------------------------------------------------------
11:59:56.603 INFO GenomicsDBImport - ------------------------------------------------------------
11:59:56.604 INFO GenomicsDBImport - HTSJDK Version: 2.24.1
11:59:56.604 INFO GenomicsDBImport - Picard Version: 2.25.4
11:59:56.604 INFO GenomicsDBImport - Built for Spark Version: 2.4.5
11:59:56.604 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:59:56.604 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:59:56.604 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:59:56.604 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:59:56.604 INFO GenomicsDBImport - Deflater: IntelDeflater
11:59:56.604 INFO GenomicsDBImport - Inflater: IntelInflater
11:59:56.604 INFO GenomicsDBImport - GCS max retries/reopens: 20
11:59:56.605 INFO GenomicsDBImport - Requester pays: disabled
11:59:56.605 INFO GenomicsDBImport - Initializing engine
12:00:01.375 INFO FeatureManager - Using codec BEDCodec to read file file:///gpfs/project/projects/spike/AG_Borkhardt/References/Covered_region.Hg38_V7.bed
12:00:02.314 INFO IntervalArgumentCollection - Processing 49668806 bp from intervals
12:00:02.382 WARN GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
11:59:56.605 INFO GenomicsDBImport - Initializing engine
12:00:01.375 INFO FeatureManager - Using codec BEDCodec to read file file:///gpfs/project/projects/spike/AG_Borkhardt/References/Covered_region.Hg38_V7.bed
12:00:02.314 INFO IntervalArgumentCollection - Processing 49668806 bp from intervals
12:00:02.382 WARN GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
12:00:02.795 INFO GenomicsDBImport - Done initializing engine
12:00:03.239 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.1-d59e886
12:00:03.242 INFO GenomicsDBImport - Vid Map JSON file will be written to /gpfs/project/yasinl/PON_db/vidmap.json
12:00:03.242 INFO GenomicsDBImport - Callset Map JSON file will be written to /gpfs/project/yasinl/PON_db/callset.json
12:00:03.242 INFO GenomicsDBImport - Complete VCF Header will be written to /gpfs/project/yasinl/PON_db/vcfheader.vcf
12:00:03.242 INFO GenomicsDBImport - Importing to workspace - /gpfs/project/yasinl/PON_db
12:02:39.541 INFO GenomicsDBImport - Importing batch 1 with 50 samples
12:02:46.690 INFO GenomicsDBImport - Importing batch 1 with 50 samples
12:02:51.923 INFO GenomicsDBImport - Importing batch 1 with 50 samples
what could be the reason of not getting this Callset.json file ?
and by the way, the output PON database directory contains the following files only:
1$12146$12310 1$12596$12760 1$13416$13667 __tiledb_workspace.tdb vcfheader.vcf vidmap.json
-
Hi Lait,
The program log you are sharing here does not look complete to me. If this is where it ends, the process likely got killed prematurely for some reason, potentially by your machine. This would explain why you are not getting all of the output files. Try to give the job more memory or more storage space to get it to complete.
We have a GenomicsDBImport performance guide here: GenomicsDBImport usage and performance guidelines.
Hope this solves your issue!
Best,
Genevieve
-
Thank you for your reply.
Yes you are right, the process was aborted,
I gave more resources, and ran the code on each chromosome separately (as you see in the -L option, I am using the V7 Agilent exam capture kit, so I divided this file, one chromosome per file, and ran the code in parallel 24 times)
My question is, how can I reassemble the output that is spread in 24 different workspaces, to be able to use it in the next step of (gatk CreateSomaticPanelOfNormals )? -
Hi Lait,
There isn't a good method to combine GenomicsDB workspaces before CreateSomaticPanelofNormals. CreateSomaticPanelofNormals can only accept one GenomicsDB workspace. In our production pipelines, VCFs with different intervals are merged after GenotypeGVCFs, which is not a step you will do when creating your PON.
A better option would either to be add samples incrementally to your GenomicsDB or decrease the batch size. I would recommend keeping all your intervals in the same command.
Best,
Genevieve
-
I`m also missing the callset.json from my database, I submit this job to the HPC:
#!/bin/bash -l
#PBS -N
#PBS -l nodes=1:ppn=32,mem=128gb
#PBS -l walltime=0048:00:00
conda activate gatk_env
gatk GenomicsDBImport -V 17 samples.g.vcf.gz --genomicsdb-workspace-path -my_cohort --intervals hg38.bed (used by the Ion torrent). Should i do something different? I am doing the whole exome sequencing.
-
You may need to setup a temporary folder accesible to gatk. You may refer to the document below.
If you still observe issues please also include your logs so that we can properly describe the problem.
Regards.
Post is closed for comments.
5 comments