Error in running GenomicsDBConfigException
AnsweredHi Guys,
I am running GATK for variant calling using bam files for two samples (example). I first created gvcf using following commands:
java -jar $GenomeAnalysisTK HaplotypeCaller -R hg19-agilent.fasta -I aln_501_trimmed_again_RG.bam -O aln_501_trimmed_again_RG.g.vcf -ERC GVCF
java -jar $GenomeAnalysisTK HaplotypeCaller -R hg19-agilent.fasta -I aln_502_trimmed_again_RG.bam -O aln_502_trimmed_again_RG.g.vcf -ERC GVCF
Then, I am trying containing database (in external drive which is 28 TB) as below: java -jar $GenomeAnalysisTK GenomicsDBImport -V aln_501_trimmed_again_RG.g.vcf -V aln_502_trimmed_again_RG.g.vcf --genomicsdb-workspace-path my_database --intervals chr1.bed.
But getting error as :
terminate called after throwing an instance of 'GenomicsDBConfigException' what(): GenomicsDBConfigException : Syntax error in JSON file /tmp/loader_11705777375230343382.json Aborted (core dumped)
Also, I am getting a pop up in my system as "low disk space in Filesystem root"
Is this error due to disk space or something else? How much disk space do I need for running this command? I am running these analysis in additional storage drives which are 28 TB. Please suggest
-
Hi Shivangi Agarwal,
Could you share your complete program log and also the GATK version number?
Thanks,
Genevieve
-
Hi
GATK version is 4.2.2.0.
Below is the command and its log.
java -jar $GenomeAnalysisTK GenomicsDBImport -V aln_501_trimmed_again_RG.g.vcf -V aln_502_trimmed_again_RG.g.vcf --genomicsdb-workspace-path my_database --intervals chr1.bed.
16:52:22.602 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/shivangiagarwal/Downloads/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Dec 18, 2021 4:52:22 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
16:52:22.735 INFO GenomicsDBImport - ------------------------------------------------------------
16:52:22.736 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.2.0
16:52:22.736 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
16:52:22.736 INFO GenomicsDBImport - Executing as shivangiagarwal@gayatry-PowerEdge-T640 on Linux v5.11.0-40-generic amd64
16:52:22.736 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v11.0.11+9-Ubuntu-0ubuntu2
16:52:22.736 INFO GenomicsDBImport - Start Date/Time: December 18, 2021 at 4:52:22 PM CST
16:52:22.736 INFO GenomicsDBImport - ------------------------------------------------------------
16:52:22.736 INFO GenomicsDBImport - ------------------------------------------------------------
16:52:22.737 INFO GenomicsDBImport - HTSJDK Version: 2.24.1
16:52:22.737 INFO GenomicsDBImport - Picard Version: 2.25.4
16:52:22.737 INFO GenomicsDBImport - Built for Spark Version: 2.4.5
16:52:22.737 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:52:22.737 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:52:22.737 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:52:22.738 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:52:22.738 INFO GenomicsDBImport - Deflater: IntelDeflater
16:52:22.738 INFO GenomicsDBImport - Inflater: IntelInflater
16:52:22.738 INFO GenomicsDBImport - GCS max retries/reopens: 20
16:52:22.738 INFO GenomicsDBImport - Requester pays: disabled
16:52:22.738 INFO GenomicsDBImport - Initializing engine
16:52:23.474 INFO FeatureManager - Using codec BEDCodec to read file file:///media/shivangiagarwal/DATA/PrCa/S31285117_Covered.bed
16:52:25.738 INFO IntervalArgumentCollection - Processing 49475726 bp from intervals
16:52:25.794 WARN GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
16:52:25.796 INFO GenomicsDBImport - Done initializing engine
16:52:26.128 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.1-d59e886
16:52:26.145 INFO GenomicsDBImport - Vid Map JSON file will be written to /home/shivangiagarwal/my_database-2/vidmap.json
16:52:26.146 INFO GenomicsDBImport - Callset Map JSON file will be written to /home/shivangiagarwal/my_database-2/callset.json
16:52:26.146 INFO GenomicsDBImport - Complete VCF Header will be written to /home/shivangiagarwal/my_database-2/vcfheader.vcf
16:52:26.146 INFO GenomicsDBImport - Importing to workspace - /home/shivangiagarwal/my_database-2
17:19:32.297 INFO GenomicsDBImport - Importing batch 1 with 2 samples
17:19:34.502 INFO GenomicsDBImport - Importing batch 1 with 2 samples
17:19:36.360 INFO GenomicsDBImport - Importing batch 1 with 2 samples
17:19:38.525 INFO GenomicsDBImport - Importing batch 1 with 2 samples
17:19:40.599 INFO GenomicsDBImport - Importing batch 1 with 2 samples
17:19:42.947 INFO GenomicsDBImport - Importing batch 1 with 2 samples
17:19:44.995 INFO GenomicsDBImport - Importing batch 1 with 2 samples
17:19:46.992 INFO GenomicsDBImport - Importing batch 1 with 2 samples
[...]
18:42:07.581 INFO GenomicsDBImport - Importing batch 1 with 2 samples
18:42:09.419 INFO GenomicsDBImport - Importing batch 1 with 2 samples
terminate called after throwing an instance of 'GenomicsDBConfigException'
what(): GenomicsDBConfigException : Syntax error in JSON file /tmp/loader_5890342525445890988.json
Aborted (core dumped) -
Hi Shivangi Agarwal,
Thank you! I updated your comment just for readability but it is good to see the whole log!
It's challenging to determine what could be the cause of the issue with the /tmp/loader_5890342525445890988.json file. I'm thinking most likely there is a mix up with your sample names (see this similar issue ticket). Or, that your temp directory is running out of space or number of file handles that can be open at once. You can explicitly specify a temporary directory with the option --tmp-dir. (See this article for more information about GenomicsDB usage).
One other thought is that this could be a strange issue arising from how you are running the command. With GATK4, we recommend that you use the GATK wrapper script when submitting commands: https://gatk.broadinstitute.org/hc/en-us/articles/360035531892-GATK4-command-line-syntax.
Could you see what is in the /tmp/loader_5890342525445890988.json file for more information?
Best,
Genevieve
-
Hi,
Ya, I can see many tmp json files created in tmp directory (total size-250GB).
I observed that as the full disk space is used, I get this error (as mentioned above) along with the pop up as "No disk space".
Now, I am again running the command specifying temp directory location in storage drive with 12 TB of space remaining.
-
Okay, those are great observations! Let me know if it is successful.
-
Hi Genevieve,
I put GenomicsDBImport on run on Monday, but my system (125.8 GiB) got freezed and finally I had to restart it.
Then, I again put it on run on another system (256 GiB) and it is still running (more than 24 hrs). Also, I noticed that temporary files of around 2.2 TB has been generated till now (command below).
java -jar $GenomeAnalysisTK GenomicsDBImport -V aln_501_trimmed_again_RG.g.vcf -V aln_502_trimmed_again_RG.g.vcf --genomicsdb-workspace-path my_database --intervals S31285117_Covered.bed --tmp-dir ./temp
So, I am wondering how much time and space it requires? I have to call variants for 48 samples.
-
Yes, speeding up GenomicsDBImport is a common question. We have an article containing our recommendations here: https://gatk.broadinstitute.org/hc/en-us/articles/360056138571-GenomicsDBImport-usage-and-performance-guidelines
I also recommend that you submit your GATK command line with the gatk wrapper script: https://gatk.broadinstitute.org/hc/en-us/articles/360035531892-GATK4-command-line-syntax
Please let me know if you have further questions.
-
Hi Genevieve,
I want to call variants for my samples which are divided into four categories as normal, stage I, stage II and stage III.
My understanding till now after going through multiple documentation is that I need to create a database for Normal samples (PON) (https://gatk.broadinstitute.org/hc/en-us/articles/360037058172-CreateSomaticPanelOfNormals-BETA-).
How to proceed further to call variants comparing my normal samples to different tumor stage samples ?
-
Hi Genevieve,
Can you please respond to the above query.
Thanks
-
Hi Shivangi Agarwal,
If all your samples are from the same individual, you can do multi sample calling with Mutect2. This will help get more high quality results. There is a forum post discussion going into details about multi-sample calling that you can read for more information. You can then use the final VCF and the read depth for each sample to determine if any variants are a part of tumor progression.
If you have 40 normal samples, you can create your own panel of normals to be used with Mutect2. If not, you can use our publicly available panel of normals created from the 1000 genomes project data. We recommend that you use a panel of normals with the --panel-of-normals Mutect2 argument and a corresponding normal sample for each tumor sample with the -normal argument for best results.
- Panel of Normals and our publicly available option: https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON-
- How to create a Panel of Normals: https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2
Please let me know if you have further questions regarding this query.
Best,
Genevieve
-
Hi Genevieve,
I have a few questions regarding my GATK run set up. These are as:
1. I have 12 normal samples. Can I make my own PON taking those twelve ? I mean is it allowed or recommended ? Does this make sense ?
2. While, I tried to create my own PON taking 12 samples using commands as below:
!gatk Mutect2 -R reference.fasta -I normal1.bam --max-mnp-distance 0 -O normal1.vcf.gz
!gatk Mutect2 -R reference.fasta -I normal1.bam --max-mnp-distance 0 -O normal2.vcf.gz!gatk GenomicsDBImport -R reference.fasta -L hg19.bed --genomicsdb-workspace-path pon_db -V normal1.vcf.gz -V normal2.vcf.gz
but I am getting error like :
"A USER ERROR has occured: Duplicate sample: $i. Sample was found in both file://normal.vcf.gz and normal2.vcf.gz
How can I fix this ?
3. To run Mutect2 (below command) , I will also need --germline-resource af-only-gnomad.vcf.gz. How to get this for hg19 ?
gatk Mutect2 \ -R reference.fa \ -I tumor.bam \ -I normal.bam \ -normal normal_sample_name \ --germline-resource af-only-gnomad.vcf.gz \ --panel-of-normals pon.vcf.gz \ -O somatic.vcf.gz
4. I am just curious to know that, If I call somatic variants separately for each sample (using mutect2 tumor only mode) whether normal or tumor and then do overlap study among resulted VCF files using my own python script to compare variants in tumor vs normal VS If I run mutect2 in tumor-normal mode (https://gatk.broadinstitute.
org/hc/en-us/articles/ ) to get variants present in tumor (which are already nullifying the ones present in normal).360035531132--How-to-Call- somatic-mutations-using-GATK4- Mutect2 Does this make any difference to the results ?
Please spare some time to answer my queries. I would really appreciate it.
Thanks,
Shivangi
-
Hi Shivangi,
- No, it's not recommended to make a PON with only 12 samples. We recommend a minimum of 40: https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON-. If not, we recommend you use our public PON.
- You will need to name all your samples different names if you are getting this error. Check out the tool RenameSampleInVcf.
- All our resources available are outlined in the resource bundle page: https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle
- Tumor-normal mode is much better than tumor-only mode. I would recommend tumor-normal mode for your main analysis, but you can always check out tumor-only if you are curious. Tumor-only will have many false positives.
Best,
Genevieve
Please sign in to leave a comment.
12 comments