Detect other jobs writing to GenomicDB with GenomicDBImport
If you are seeing an error, please provide(REQUIRED) :
a) GATK version used: 4.1.8.1
b) Exact command used:
c) Entire error log:
If not an error, choose a category for your question(REQUIRED):
a)How do I (......)? How do I find out if there is another job trying to (read from/write to) the GenomicDB repo?
b) What does (......) mean?
c) Why do I see (......)?
d) Where do I find (......)?
e) Will (......) be in future releases?
I have a central repo to store the GenomicDB. When I try to process one batch of NGS samples, it tries to write GVCF records to the central GenomicDB with GenomicsDBImport. So when multiple batches of NGS samples are processed simultaneously, they are writing to the GenomicDB at the same time. I know this is allowed by TileDB, which GenomicDB is based on. Because each update operation creates a new so-called fragment.
Here is the case, I sometimes have to stop a job of one batch in the middle of GenomicDBImport. When I do this, the last update operation is not completed, hence leaving a callset.json.inc.backup file and a callset.json.fragmentlist file, tagging the whole GenomicDB corrupted. According to this tutorial https://gatk.broadinstitute.org/hc/en-us/articles/360035891051-GenomicsDB, I can restore the corrupted GenomicDB by cover the callset.json with callset.json.inc.backup and remove all the sub-folders not on the callset.json.fragmentlist. This sounds good when I know no other jobs are accessing this GenomicDB.
But when I know some other jobs may be writing to this GenomicDB, will this recover action impede the writing action from other jobs? How can I know if there is other jobs accessing this GenomicDB repo?
Thanks!
-
Hi Yangyxt,
It is not possible to run the same GenomicsDBImport command on the same workspace, this will lead to issues. We also do not recommend running GenotypeGVCFs while GenomicsDBImport is running. So you should not have multiple jobs running at once accessing the same GenomicsDB workspace.
If you are looking to parallelize the import by interval, you can use the --batch-size parameter and the --max-num-intervals-to-import-in-parallel [More information in the GenomicsDBImport docs].
Please sign in to leave a comment.
1 comment