GenomicsDB incremental error
If you are seeing an error, please provide(REQUIRED) :
a) GATK version used: 4.1.4.1
b) Exact command used:
gatk GenomicsDBImport -V S1_Haplotypecaller.g.vcf -V S2_Haplotypecaller.g.vcf -V S3_Haplotypecaller.g.vcf --genomicsdb-update-workspace-path GenomicsDB_subset/chr1_dstore --intervals chr1
c) Entire error log:
21:02:55.282 INFO ProgressMeter - Starting traversal
21:02:55.283 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
21:03:36.563 INFO GenomicsDBImport - Importing batch 1 with 24 samples
terminate called after throwing an instance of 'FileBasedVidMapperException'
what(): FileBasedVidMapperException : Conflicting file/stream names specified for sample/callset S1 S2_stream, S1_stream
Using GATK jar /gpfs/data/user/krithika/gatk-4.1.4.1/gatk-package-4.1.4.1-local.jar
Could you please check the above warning and is this cause any problem for the results.?
-
Please see this link for information about this error message. Users were seeing it when there are duplicated fields in the GVCFs. The team was able to solve this issue and if you use a newer version of GATK it will be solved. Try the newest version, 4.1.9.0, it has the best version of GenomicsDBImport.
Genevieve
-
Dear Brandt,
I just ran into the same issue here and I used GATK 4.1.9.0(However, the genomicDB I generated was from GATK 4.1.8.1). the link of github issue report seems to be relevant with conflicting field names in GVCF files. e.g. DP in both FORMAT and INFO.
I'm not quite sure why the file names are conflicting according to the error message:
Before incrementally update the genomicDB, I always check:
1. the callset.json file to make sure no sample IDs are conflicting with the sample ID in the gvcf file I'm about to import.
2. check the <xxx>.fragmentlist file and subfolders in genomicDB to make sure the whole database is not corrupted.
However, it seems both operation cannot prevent this issue from happening.
Could you pls elaborate more details on the reasons of this issue? Or exactly what part of GVCF files might have duplicated info so I can fix the gvcf files before importing them ? Thanks!
-
Dear Brandt,
It just occurred to me that the meta-lines might change in gvcf files using different version of GATK.
Since it is a vidmapper exception, I checked the vidmap.json in genomicDB repo and I realized that if the imported gvcf files have a bit different content in their meta-lines, would that cause an FileBasedVidMapperException?
BTW, I run HC on each contig and gather them to one using GatherVcfs in GATK and since the ##GVCFBlock meta-line might be different I used the in-house written script to generate a common vcf meta-lines. Replace the common vcf meta-lines with original vcf meta-lines and do the gatk gathervcf.
-
- I suggest rerunning the workflow without changing the gvcf files as mentioned here: I run HC on each contig and gather them to one using GatherVcfs in GATK and since the ##GVCFBlock meta-line might be different I used the in-house written script to generate a common vcf meta-lines. Replace the common vcf meta-lines with original vcf meta-lines and do the gatk gathervcf.
- Please follow the steps provided in this best practices doc: https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-
- If that doesn't fix it then please provide the entire error log, the version of tools used and the exact commands used for generating the gvcfs and the genomicsdb.
-
I also recommend going through and using these prebuilt workflows: https://app.terra.bio/#workspaces/help-gatk/GATK4-Germline-Preprocessing-VariantCalling-JointCalling
More info on how to use Terra take a looka t this doc: https://gatk.broadinstitute.org/hc/en-us/articles/360041155152-GATK-on-the-cloud-with-Terra
Please sign in to leave a comment.
5 comments