.GenomicsDBException: Duplicate sample name found:
Hello:
We have run into the GenomicsDBImport error: "org.genomicsdb.exception.GenomicsDBException: Duplicate sample name found" on our attempts to update DBs. Below is (first) a command, followed by the output.
We have checked carefully that the indicated duplicate, SSC00007.haplotypeCalls.CPR.er.raw.vcf.gz is not already in the target DB, DB_chr1 .
Note that the gatk version in 4.1.6.0. We hope to have the latest version, 4.1.8.0 available soon.
Is it possible that this error is fixed in the the latest version?
Any help will be appreciated.
Cheers,
Chuck
========== command ==============
module load gatk/4.1.6.0
gatk --java-options "-Xmx16g -Xms16g" GenomicsDBImport \
--batch-size 24 \
--reader-threads 12 \
--genomicsdb-update-workspace-path /rooted3/langley/work/home/chuck/rad/SFARI/SSC_hg38/WGS/CPRs_100_proto/DB_chr1 \
--intervals chr1:118739963-147510543 \
--verbosity DEBUG \
-V /rooted3/langley/work/home/chuck/rad/SFARI/SSC_hg38/WGS/phase2_CPRs/SSC00007_CPR/SSC00007.haplotypeCalls.CPR.er.raw.vcf.gz
============================
======= output =============
-
Hi Nicholas Bailey,
Unfortunately there is no way to remove the samples from the GenomicsDB workspace. This is why we recommend that users create a backup of the GenomicsDB workspace before updating.
Here is a ticket where the developers are discussing this: https://github.com/broadinstitute/gatk/issues/6558
Best,
Genevieve
-
We did upgrade to gatk V4.1.8.1.
But the same error appears, "org.genomicsdb.exception.GenomicsDBException: Duplicate sample"
Thanks for any help.
Cheers,
Chuck
-
Charles H. Langley are you running this in parallel? Could you explain which commands are running at the same time?
-
Hello Genevieve:
"running this is parallel" ?
I am not sure what level of parallelization to which you refer.
The previous commands included multithreading as in "--reader-threads 12 \"
It also included "--batch-size 24 \". But it did not involve mpi or sparks.
Indeed no other gatk job was running on the system.
----------
I have further stripped down the command (defaults for --reader-threads and batch-size; see below).
Still the same error occurs.
We look forward to hearing further from you and your colleagues with ideas about what may be wrong here.
Cheers,
Chuck
________________________________________________________________________________| => gatk --java-options "-Xmx16g -Xms16g" GenomicsDBImport \| => --genomicsdb-update-workspace-path /rooted3/langley/work/home/chuck/rad/SFARI/SSC_hg38/WGS/CPRs_100_proto/DB_chr1 \| => --intervals chr1:118739963-147510543 \| => --verbosity DEBUG \| => -V /rooted3/langley/work/home/chuck/rad/SFARI/SSC_hg38/WGS/phase2_CPRs/SSC00007_CPR/SSC00007.haplotypeCalls.CPR.er.raw.vcf.gzUsing GATK jar /afs/genomecenter.ucdavis.edu/software/gatk/4.1.8.1/static/gatk-package-4.1.8.1-local.jarRunning:java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx16g -Xms16g -jar /afs/genomecenter.ucdavis.edu/software/gatk/4.1.8.1/static/gatk-package-4.1.8.1-local.jar GenomicsDBImport --genomicsdb-update-workspace-path /rooted3/langley/work/home/chuck/rad/SFARI/SSC_hg38/WGS/CPRs_100_proto/DB_chr1 --intervals chr1:118739963-147510543 --verbosity DEBUG -V /rooted3/langley/work/home/chuck/rad/SFARI/SSC_hg38/WGS/phase2_CPRs/SSC00007_CPR/SSC00007.haplotypeCalls.CPR.er.raw.vcf.gz11:12:45.223 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/afs/genomecenter.ucdavis.edu/software/gatk/4.1.8.1/static/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so11:12:45.275 DEBUG NativeLibraryLoader - Extracting libgkl_compression.so to /tmp/libgkl_compression8725280501251879565.soSep 03, 2020 11:12:45 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngineINFO: Failed to detect whether we are running on Google Compute Engine.11:12:45.569 INFO GenomicsDBImport - ------------------------------------------------------------11:12:45.570 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.8.111:12:45.570 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/11:12:45.570 INFO GenomicsDBImport - Executing as chuck@rooted3 on Linux v4.15.0-66-generic amd6411:12:45.570 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_265-8u265-b01-0ubuntu2~16.04-b0111:12:45.571 INFO GenomicsDBImport - Start Date/Time: September 3, 2020 11:12:45 AM PDT11:12:45.571 INFO GenomicsDBImport - ------------------------------------------------------------11:12:45.571 INFO GenomicsDBImport - ------------------------------------------------------------11:12:45.572 INFO GenomicsDBImport - HTSJDK Version: 2.23.011:12:45.572 INFO GenomicsDBImport - Picard Version: 2.22.811:12:45.574 INFO GenomicsDBImport - HTSJDK Defaults.BUFFER_SIZE : 13107211:12:45.574 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 211:12:45.575 INFO GenomicsDBImport - HTSJDK Defaults.CREATE_INDEX : false11:12:45.575 INFO GenomicsDBImport - HTSJDK Defaults.CREATE_MD5 : false11:12:45.575 INFO GenomicsDBImport - HTSJDK Defaults.CUSTOM_READER_FACTORY :11:12:45.575 INFO GenomicsDBImport - HTSJDK Defaults.DISABLE_SNAPPY_COMPRESSOR : false11:12:45.575 INFO GenomicsDBImport - HTSJDK Defaults.EBI_REFERENCE_SERVICE_URL_MASK : https://www.ebi.ac.uk/ena/cram/md5/%s11:12:45.575 INFO GenomicsDBImport - HTSJDK Defaults.NON_ZERO_BUFFER_SIZE : 13107211:12:45.575 INFO GenomicsDBImport - HTSJDK Defaults.REFERENCE_FASTA : null11:12:45.575 INFO GenomicsDBImport - HTSJDK Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL11:12:45.575 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false11:12:45.576 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true11:12:45.576 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false11:12:45.576 INFO GenomicsDBImport - HTSJDK Defaults.USE_CRAM_REF_DOWNLOAD : false11:12:45.576 DEBUG ConfigFactory - Configuration file values:11:12:45.580 DEBUG ConfigFactory - gcsMaxRetries = 2011:12:45.580 DEBUG ConfigFactory - gcsProjectForRequesterPays =11:12:45.581 DEBUG ConfigFactory - gatk_stacktrace_on_user_exception = false11:12:45.581 DEBUG ConfigFactory - samjdk.use_async_io_read_samtools = false11:12:45.581 DEBUG ConfigFactory - samjdk.use_async_io_write_samtools = true11:12:45.581 DEBUG ConfigFactory - samjdk.use_async_io_write_tribble = false11:12:45.581 DEBUG ConfigFactory - samjdk.compression_level = 211:12:45.581 DEBUG ConfigFactory - spark.kryoserializer.buffer.max = 512m11:12:45.581 DEBUG ConfigFactory - spark.driver.maxResultSize = 011:12:45.581 DEBUG ConfigFactory - spark.driver.userClassPathFirst = true11:12:45.581 DEBUG ConfigFactory - spark.io.compression.codec = lzf11:12:45.582 DEBUG ConfigFactory - spark.executor.memoryOverhead = 60011:12:45.582 DEBUG ConfigFactory - spark.driver.extraJavaOptions =11:12:45.582 DEBUG ConfigFactory - spark.executor.extraJavaOptions =11:12:45.582 DEBUG ConfigFactory - codec_packages = [htsjdk.variant, htsjdk.tribble, org.broadinstitute.hellbender.utils.codecs]11:12:45.582 DEBUG ConfigFactory - read_filter_packages = [org.broadinstitute.hellbender.engine.filters]11:12:45.582 DEBUG ConfigFactory - annotation_packages = [org.broadinstitute.hellbender.tools.walkers.annotator]11:12:45.582 DEBUG ConfigFactory - cloudPrefetchBuffer = 4011:12:45.582 DEBUG ConfigFactory - cloudIndexPrefetchBuffer = -111:12:45.582 DEBUG ConfigFactory - createOutputBamIndex = true11:12:45.583 INFO GenomicsDBImport - Deflater: IntelDeflater11:12:45.583 INFO GenomicsDBImport - Inflater: IntelInflater11:12:45.583 INFO GenomicsDBImport - GCS max retries/reopens: 2011:12:45.583 INFO GenomicsDBImport - Requester pays: disabled11:12:45.583 INFO GenomicsDBImport - Initializing engine11:12:45.794 WARN GenomicsDBImport - genomicsdb-update-workspace-path was set, so ignoring specified intervals.The tool will use the intervals specified by the initial import11:12:46.188 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.0-e70190511:12:46.651 DEBUG GenomeLocParser - Prepared reference sequence contig dictionary11:12:46.651 DEBUG GenomeLocParser - chr1 (248956422 bp)11:12:46.652 DEBUG GenomeLocParser - chr2 (242193529 bp)11:12:46.652 DEBUG GenomeLocParser - chr3 (198295559 bp)11:12:46.652 DEBUG GenomeLocParser - chr4 (190214555 bp)11:12:46.652 DEBUG GenomeLocParser - chr5 (181538259 bp)11:12:46.652 DEBUG GenomeLocParser - chr6 (170805979 bp)11:12:46.652 DEBUG GenomeLocParser - chr7 (159345973 bp)11:12:46.652 DEBUG GenomeLocParser - chr8 (145138636 bp)11:12:46.653 DEBUG GenomeLocParser - chr9 (138394717 bp)11:12:46.653 DEBUG GenomeLocParser - chr10 (133797422 bp)11:12:46.653 DEBUG GenomeLocParser - chr11 (135086622 bp)11:12:46.653 DEBUG GenomeLocParser - chr12 (133275309 bp)11:12:46.653 DEBUG GenomeLocParser - chr13 (114364328 bp)11:12:46.653 DEBUG GenomeLocParser - chr14 (107043718 bp)11:12:46.653 DEBUG GenomeLocParser - chr15 (101991189 bp)11:12:46.653 DEBUG GenomeLocParser - chr16 (90338345 bp)11:12:46.654 DEBUG GenomeLocParser - chr17 (83257441 bp)11:12:46.654 DEBUG GenomeLocParser - chr18 (80373285 bp)11:12:46.654 DEBUG GenomeLocParser - chr19 (58617616 bp)11:12:46.654 DEBUG GenomeLocParser - chr20 (64444167 bp)11:12:46.654 DEBUG GenomeLocParser - chr21 (46709983 bp)11:12:46.655 DEBUG GenomeLocParser - chr22 (50818468 bp)11:12:46.655 DEBUG GenomeLocParser - chrX (156040895 bp)11:12:46.655 DEBUG GenomeLocParser - chrY (57227415 bp)11:12:46.655 DEBUG GenomeLocParser - chrM (16569 bp)11:12:46.655 DEBUG GenomeLocParser - chr1_KI270706v1_random (175055 bp)11:12:46.655 DEBUG GenomeLocParser - chr1_KI270707v1_random (32032 bp)11:12:46.655 DEBUG GenomeLocParser - chr1_KI270708v1_random (127682 bp)11:12:46.655 DEBUG GenomeLocParser - chr1_KI270709v1_random (66860 bp)11:12:46.656 DEBUG GenomeLocParser - chr1_KI270710v1_random (40176 bp)11:12:46.656 DEBUG GenomeLocParser - chr1_KI270711v1_random (42210 bp)11:12:46.656 DEBUG GenomeLocParser - chr1_KI270712v1_random (176043 bp)11:12:46.656 DEBUG GenomeLocParser - chr1_KI270713v1_random (40745 bp)11:12:46.656 DEBUG GenomeLocParser - chr1_KI270714v1_random (41717 bp)11:12:46.656 DEBUG GenomeLocParser - chr2_KI270715v1_random (161471 bp)11:12:46.656 DEBUG GenomeLocParser - chr2_KI270716v1_random (153799 bp)11:12:46.656 DEBUG GenomeLocParser - chr3_GL000221v1_random (155397 bp)11:12:46.656 DEBUG GenomeLocParser - chr4_GL000008v2_random (209709 bp)11:12:46.657 DEBUG GenomeLocParser - chr5_GL000208v1_random (92689 bp)11:12:46.657 DEBUG GenomeLocParser - chr9_KI270717v1_random (40062 bp)11:12:46.657 DEBUG GenomeLocParser - chr9_KI270718v1_random (38054 bp)11:12:46.657 DEBUG GenomeLocParser - chr9_KI270719v1_random (176845 bp)11:12:46.657 DEBUG GenomeLocParser - chr9_KI270720v1_random (39050 bp)11:12:46.657 DEBUG GenomeLocParser - chr11_KI270721v1_random (100316 bp)11:12:46.657 DEBUG GenomeLocParser - chr14_GL000009v2_random (201709 bp)11:12:46.657 DEBUG GenomeLocParser - chr14_GL000225v1_random (211173 bp)11:12:46.658 DEBUG GenomeLocParser - chr14_KI270722v1_random (194050 bp)11:12:46.658 DEBUG GenomeLocParser - chr14_GL000194v1_random (191469 bp)11:12:46.658 DEBUG GenomeLocParser - chr14_KI270723v1_random (38115 bp)11:12:46.658 DEBUG GenomeLocParser - chr14_KI270724v1_random (39555 bp)11:12:46.658 DEBUG GenomeLocParser - chr14_KI270725v1_random (172810 bp)11:12:46.658 DEBUG GenomeLocParser - chr14_KI270726v1_random (43739 bp)11:12:46.658 DEBUG GenomeLocParser - chr15_KI270727v1_random (448248 bp)11:12:46.658 DEBUG GenomeLocParser - chr16_KI270728v1_random (1872759 bp)11:12:46.658 DEBUG GenomeLocParser - chr17_GL000205v2_random (185591 bp)11:12:46.659 DEBUG GenomeLocParser - chr17_KI270729v1_random (280839 bp)11:12:46.659 DEBUG GenomeLocParser - chr17_KI270730v1_random (112551 bp)11:12:46.659 DEBUG GenomeLocParser - chr22_KI270731v1_random (150754 bp)11:12:46.659 DEBUG GenomeLocParser - chr22_KI270732v1_random (41543 bp)11:12:46.659 DEBUG GenomeLocParser - chr22_KI270733v1_random (179772 bp)11:12:46.659 DEBUG GenomeLocParser - chr22_KI270734v1_random (165050 bp)11:12:46.659 DEBUG GenomeLocParser - chr22_KI270735v1_random (42811 bp)11:12:46.659 DEBUG GenomeLocParser - chr22_KI270736v1_random (181920 bp)11:12:46.659 DEBUG GenomeLocParser - chr22_KI270737v1_random (103838 bp)11:12:46.660 DEBUG GenomeLocParser - chr22_KI270738v1_random (99375 bp)11:12:46.660 DEBUG GenomeLocParser - chr22_KI270739v1_random (73985 bp)11:12:46.660 DEBUG GenomeLocParser - chrY_KI270740v1_random (37240 bp)11:12:46.660 DEBUG GenomeLocParser - chrUn_KI270302v1 (2274 bp)11:12:46.661 DEBUG GenomeLocParser - chrUn_KI270304v1 (2165 bp)11:12:46.661 DEBUG GenomeLocParser - chrUn_KI270303v1 (1942 bp)...lots of unassembled scaffolds and decoys...11:12:46.793 DEBUG GenomeLocParser - HLA-DRB1*15:03:01:02 (11569 bp)11:12:46.793 DEBUG GenomeLocParser - HLA-DRB1*16:02:01 (11005 bp)11:12:46.812 INFO IntervalArgumentCollection - Processing 28770581 bp from intervals11:12:46.814 INFO GenomicsDBImport - Done initializing engine11:12:46.814 INFO GenomicsDBImport - Callset Map JSON file will be re-written to /rooted3/langley/work/home/chuck/rad/SFARI/SSC_hg38/WGS/CPRs_100_proto/DB_chr1/callset.json11:12:46.814 INFO GenomicsDBImport - Incrementally importing to workspace - /rooted3/langley/work/home/chuck/rad/SFARI/SSC_hg38/WGS/CPRs_100_proto/DB_chr111:12:46.814 INFO ProgressMeter - Starting traversal11:12:46.815 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute11:12:47.254 INFO GenomicsDBImport - Shutting down engine[September 3, 2020 11:12:47 AM PDT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.03 minutes.Runtime.totalMemory()=16464216064org.genomicsdb.exception.GenomicsDBException: Duplicate sample name found: SSC00007. Sample was originally in /rooted3/langley/work/home/chuck/rad/SFARI/SSC_hg38/WGS/phase2_CPRs/SSC00007_CPR/SSC00007.haplotypeCalls.CPR.er.raw.vcf.gzat org.genomicsdb.importer.extensions.CallSetMapExtensions.checkDuplicateCallsetsForIncrementalImport(CallSetMapExtensions.java:270)at org.genomicsdb.importer.extensions.CallSetMapExtensions.mergeCallsetsForIncrementalImport(CallSetMapExtensions.java:241)at org.genomicsdb.importer.GenomicsDBImporter.<init>(GenomicsDBImporter.java:252)at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:745)at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1049)at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)at org.broadinstitute.hellbender.Main.main(Main.java:289)________________________________________________________________________________ -
Charles H. Langley could you submit a bug report?
Please upload or provide notes for where to obtain:
- workspace you are updating
- reference
- sample
- interval to test
-
Hello Charles H. Langley,
Is it possible that this "workspace update" process failed with a different error before? If so, it's possible that some metadata within the workspace is in an inconsistent state. Can you list the contents of the workspace - I'm specifically interested in any files that end in *.json or *.inc.backup?
-
contents of DB_chr1/:
-rwxrwx--- 1 sasha radusr 0 2020-04-20-14:02 __tiledb_workspace.tdb
-rwxrwx--- 1 sasha radusr 308K 2020-04-20-14:02 vcfheader.vcf
-rwxrwx--- 1 sasha radusr 286K 2020-04-20-14:02 vidmap.json
drwxrwx--- 282 sasha radusr 284 2020-05-12-10:04 chr1$118739963$147510543/
-rwxrwx--- 1 chuck chuck 825K 2020-08-28-13:38 callset.json
-rwx------ 1 chuck chuck 19K 2020-09-03-11:12 callset.json.fragmentlist
-rwx------ 1 chuck chuck 825K 2020-09-03-11:12 callset.json.inc.backupShall I upload certain of these?
Cheers,
Chuck
-
Not yet -- can you do a search/grep for the duplicate sample name within callset.json and callset.json.inc.backup? So something like:
grep SSC00007 callset.json
and
grep SSC00007 callset.json.inc.backup
Assuming the sample name it complained about is SSC00007
Offhand, they look identical, which makes me wonder if the update was tried (and failed) multiple times. Also, do you have a backup of this workspace?
-
"SSC00007" in not in either callset.json or in callset.json.inc.backup .
Thanks,
Chuck
-
Ah - interesting. Yes, if you don't mind uploading the callset.json and callset.json.fragmentlist as part of the bug report that would be useful.
If callset.json.inc.backup is different from callset.json, please include that as well.
-
Bug Report Re:
GenomicsDBException: Duplicate sample name found:
CH Langley 2Sept2020
Ah - interesting. Yes, if you don't mind uploading the callset.json and callset.json.fragmentlist as part of the bug report that would be useful.
The command_log file and these to requested files are in
Genomic_DBImport_bug_report.zip at ftp.broadinstitute.org
If callset.json.inc.backup is different from callset.json, please include that as well.
diff reported NO differences between that two files.
Thanks for the help.
Cheers,
Chuck
-
One added item, the directory listing from the DB: (notice the permissions, should they matter).
| => ll
total 2.7M
-rwxrwx--- 1 sasha radusr 0 2020-04-20-14:02 __tiledb_workspace.tdb
-rwxrwx--- 1 sasha radusr 308K 2020-04-20-14:02 vcfheader.vcf
-rwxrwx--- 1 sasha radusr 286K 2020-04-20-14:02 vidmap.json
drwxrwx--- 282 sasha radusr 284 2020-05-12-10:04 chr1$118739963$147510543/
-rwxrwx--- 1 chuck chuck 825K 2020-08-28-13:38 callset.json
-rwx------ 1 chuck chuck 19K 2020-09-03-11:12 callset.json.fragmentlist
-rwx------ 1 chuck chuck 825K 2020-09-03-11:12 callset.json.inc.backup
__________________________________________________________________________ -
Charles H. Langley thank you for uploading the file, we will work on it and let you know if we have any questions or updates.
-
Genevieve-Brandt-she-her thanks for passing the file along!
Charles H. Langley -- the callset.json file you uploaded did have SSC00007 in there...any chance you looked at (or uploaded) the wrong file? Or is it possible that you searched using the letter O instead of the number zero (0)?
In any case, the metadata for the datastore indicates it already has data for that sample. If you want to figure out which samples the metadata thinks are part of the datastore currently, you could try a command like this:
python -m json.tool callset.json |grep sample_name|cut -d\" -f4
Keep in mind, it is possible that the metadata might be in an inconsistent state due to workspace update failure...but it is unlikely SSC00007 is in there due to a similar failure, it is one of the earlier samples in the callset.json file you sent.
-
Hello GATK Team,
I am having this same issue using GATK 4.1.7.0 and found that the samples I intend to add to the database are in fact present in the callset.json file and respective backup file, but I know they have not actually been fully added to the database because most of the chromosome directories have not been updated (when viewing with something like ls -l). If I'm understanding the output correctly, it seems GenomicsDBImport ended before finishing and attempted to restart, so that would explain why these samples are only partially added.
Reading this thread has helped me come to the above conclusion (thank you Genevieve and Melvin) but I am still unsure how to fix the issue. Is there any way to remove a sample from the database and try again?
-
Hi Genevieve,
Thank you for getting back to me and posting that link. I figured that may not be an option. No harm, I haven't added too many samples yet and I'll be sure to back up the DB I'm generating now.
-
Oh good! Glad you found it out on the earlier side. Thanks for writing into the forum!
Please sign in to leave a comment.
17 comments