GenomicsDBImport only going through the first chromosome when given a list
Hi there,
I am using GATK v4.0.11.0 and having an issue with GenomicsDBImport. I've tried giving it intervals in the actual command and as a list so that I can make a database for my entire genome, but it won't go beyond chromosome 1.
Here is the first of two commands I've tried:
gatk --java-options "-Xmx72g" GenomicsDBImport \
--sample-name-map ${Name}_sample_map \
--genomicsdb-workspace-path /cluster/home/cgoeckeritz/bwa_full/moms/gatk/${Name}_DB/${Name}_database \
--intervals /cluster/home/cgoeckeritz/bwa_full/moms/gatk/intervals.list \
--reader-threads 8
The intervals file looks is just one chromosome per line:
chr1A
chr2A
chr3A
chr4A
chr5A
chr6A
chr7A
chr8A
chr9A
chr10A
chr11A
chr12A
chr13A
chr14A
chr15A
chr16A
chr17A
Here's the second command I've tried, but I get the same problem:
gatk --java-options "-Xmx284g" GenomicsDBImport \
--sample-name-map ${Name}_sample_map \
--genomicsdb-workspace-path /cluster/home/cgoeckeritz/bwa_full/moms/gatk/${Name}_DB/${Name}_database_B \
-L chr1A \
-L chr2A \
-L chr3A \
-L chr4A \
-L chr5A \
-L chr6A \
-L chr7A \
-L chr8A \
-L chr9A \
-L chr10A \
-L chr11A \
-L chr12A \
-L chr13A \
-L chr14A \
-L chr15A \
-L chr16A \
-L chr17A \
--reader-threads 24
The program log outputs the following:
/cluster/software/gatk-4.0.11.0/gatk:80: SyntaxWarning: "is" with a literal. Did you mean "=="?
if len(args) is 0 or (len(args) is 1 and (args[0] == "--help" or args[0] == "-h")):
/cluster/software/gatk-4.0.11.0/gatk:80: SyntaxWarning: "is" with a literal. Did you mean "=="?
if len(args) is 0 or (len(args) is 1 and (args[0] == "--help" or args[0] == "-h")):
/cluster/software/gatk-4.0.11.0/gatk:117: SyntaxWarning: "is" with a literal. Did you mean "=="?
if len(args) is 1 and args[0] == "--list":
/cluster/software/gatk-4.0.11.0/gatk:301: SyntaxWarning: "is" with a literal. Did you mean "=="?
if call(["gsutil", "-q", "stat", gcsjar]) is 0:
/cluster/software/gatk-4.0.11.0/gatk:305: SyntaxWarning: "is" with a literal. Did you mean "=="?
if call(["gsutil", "cp", jar, gcsjar]) is 0:
/cluster/software/gatk-4.0.11.0/gatk:458: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if not len(properties) is 0:
/cluster/software/gatk-4.0.11.0/gatk:462: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if not len(filesToAdd) is 0:
Using GATK jar /cluster/software/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx284g -jar /cluster/software/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar GenomicsDBImport --sample-name-map Mspectabilis_588917_sample_map --genomicsdb-workspace-path /cluster/home/cgoeckeritz/bwa_full/moms/gatk/Mspectabilis_588917_DB/Mspectabilis_588917_database_B -L chr1A -L chr2A -L chr3A -L chr4A -L chr5A -L chr6A -L chr7A -L chr8A -L chr9A -L chr10A -L chr11A -L chr12A -L chr13A -L chr14A -L chr15A -L chr16A -L chr17A --reader-threads 24
10:39:32.368 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/cluster/software/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
10:39:34.002 INFO GenomicsDBImport - ------------------------------------------------------------
10:39:34.002 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.11.0
10:39:34.002 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
10:39:34.002 INFO GenomicsDBImport - Executing as cgoeckeritz@hpc0006 on Linux v4.18.0-477.21.1.el8_8.x86_64 amd64
10:39:34.002 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_332-b09
10:39:34.002 INFO GenomicsDBImport - Start Date/Time: August 12, 2024 10:39:32 AM CDT
10:39:34.002 INFO GenomicsDBImport - ------------------------------------------------------------
10:39:34.003 INFO GenomicsDBImport - ------------------------------------------------------------
10:39:34.003 INFO GenomicsDBImport - HTSJDK Version: 2.16.1
10:39:34.003 INFO GenomicsDBImport - Picard Version: 2.18.13
10:39:34.003 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
10:39:34.003 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
10:39:34.003 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
10:39:34.004 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
10:39:34.004 INFO GenomicsDBImport - Deflater: IntelDeflater
10:39:34.004 INFO GenomicsDBImport - Inflater: IntelInflater
10:39:34.004 INFO GenomicsDBImport - GCS max retries/reopens: 20
10:39:34.004 INFO GenomicsDBImport - Requester pays: disabled
10:39:34.004 INFO GenomicsDBImport - Initializing engine
10:39:34.376 INFO IntervalArgumentCollection - Processing 643565055 bp from intervals
10:39:34.378 INFO GenomicsDBImport - Done initializing engine
10:39:34.480 INFO GenomicsDBImport - Vid Map JSON file will be written to /cluster/home/cgoeckeritz/bwa_full/moms/gatk/Mspectabilis_588917_DB/Mspectabilis_588917_database_B/vidmap.json
10:39:34.480 INFO GenomicsDBImport - Callset Map JSON file will be written to /cluster/home/cgoeckeritz/bwa_full/moms/gatk/Mspectabilis_588917_DB/Mspectabilis_588917_database_B/callset.json
10:39:34.480 INFO GenomicsDBImport - Complete VCF Header will be written to /cluster/home/cgoeckeritz/bwa_full/moms/gatk/Mspectabilis_588917_DB/Mspectabilis_588917_database_B/vcfheader.vcf
10:39:34.481 INFO GenomicsDBImport - Importing to array - /cluster/home/cgoeckeritz/bwa_full/moms/gatk/Mspectabilis_588917_DB/Mspectabilis_588917_database_B/genomicsdb_array
10:39:34.482 INFO ProgressMeter - Starting traversal
10:39:34.482 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
10:39:34.575 INFO GenomicsDBImport - Starting batch input file preload
10:39:35.146 INFO GenomicsDBImport - Finished batch preload
10:39:35.146 INFO GenomicsDBImport - Importing batch 1 with 34 samples
14:52:20.481 INFO GenomicsDBImport - Starting batch input file preload
14:52:20.942 INFO GenomicsDBImport - Finished batch preload
14:52:20.943 INFO GenomicsDBImport - Importing batch 1 with 34 samples
14:52:20.946 INFO GenomicsDBImport - Starting batch input file preload
14:52:20.946 INFO GenomicsDBImport - Shutting down engine
[August 12, 2024 2:52:20 PM CDT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 252.81 minutes.
Runtime.totalMemory()=1663041536
org.broadinstitute.hellbender.exceptions.GATKException: Cannot call query with different interval, expected:chr1A:1-33734251 queried with: chr2A:1-35818875
at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport$InitializedQueryWrapper.query(GenomicsDBImport.java:816)
at com.intel.genomicsdb.importer.GenomicsDBImporter.<init>(GenomicsDBImporter.java:136)
at com.intel.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:563)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
It seems the concerning part of the log is 'Cannot call query with different interval', but I'm unsure what this means, or how to fix it... Any help would be very appreciated!
Thanks so much for your time,
Charity
-
The version you are using is quite old and unsupported. Can you try using GATK 4.6.0.0 as it is the most recent and with many fixes and enhancements?
A couple of things to note: We recommend importing a single interval per GenomicsDB import instance to make things less convoluted and faster. You may be able to import each contig to a separate instance and genotype them separately to save time. If the default python is python2 then it may be possible to observe such issues. We recommend using python 3.6 and later.
I hope this helps.
-
Hi Gökalp Çelik,
Thanks so much for your reply. I can ask our admin to install the latest version but I've already run HaplotypeCaller with the present version, which took me several days... so I hesitate to switch, and I was hoping to get DBImport to work first with the current version. If I'm still beating my head against a wall with a few more tries I guess I'll have to upgrade!
I don't have an issue with running each chromosome (interval) separately, but I'd have to create a new database path each time, no? Could I combine all of my chromosome databases afterward?
Thanks again for your help!
Kindly,Charity
-
Hi again.
You need to create separate databases for each contig that is for sure. Also you need to genotype them separately to create one VCF file per contig. Once you get those VCF files you can combine them together using GatherVcfs tool to have one single call file for all contigs and samples.
I hope this helps.
Please sign in to leave a comment.
3 comments