GenomicsDB is a datastore format developed by our collaborators at Intel to store variant call data (where "datastore" = something that we mere mortals can think of as a database, even though IT professionals insist that it's a completely different thing). The long-term vision is that ultimately we will use this datastore format as an alternative to VCF files for storing and working with variant data. For now though, we are only actively using it as a GVCF consolidation tool in the germline joint-calling workflow.
Note that at the moment GenomicsDB only supports diploid data; our Intel collaborators are working on implementing support for non-diploid data, but in the meantime if you need to work with non-diploid data you'll need to use CombineGVCFs instead.
There are currently five supported operations you can do with a GenomicsDB datastore: create a new GenomicsDB datastore from one or more GVCFs, joint-call it, extract sample data from it, add new GVCFs and generate an interval_list from an existing GenomicsDB datastore.
- Create a new GenomicsDB datastore from one or more GVCFs
- Joint-call samples in a GenomicsDB datastore
- Extract data from a GenomicsDB datastore
- Incrementally add new GVCFs to an existing GenomicsDB datastore
- Generate interval_list for an existing GenomicsDB datastore
1. Create a new GenomicsDB datastore from one or more GVCFs
The goal of this operation is to consolidate a set of GVCFs into a single datastore that
GenotypeGVCFs can run on (because
GenotypeGVCFs can only take a single input). To do this via GenomicsDB, we use the
GenomicsDBImport tool. This tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v22.214.171.124 and later and stable in v126.96.36.199 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.
Here's what a typical command looks like:
gatk GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsdb-workspace-path my_database \ --intervals chr20,chr21
This command generates a directory called
my_database containing the combined GVCF data.
Note that the GVCFs can also be passed in as a list or map instead of being enumerated in the command.
2. Joint-call samples in a GenomicsDB datastore
Once you have a GenomicsDB datastore containing GVCF data from one or more sample, you can run GenotypeGVCFs on it to joint-call the samples it contains.
Here's an example command:
gatk GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \ -G StandardAnnotation -newQual \ -O test_output.vcf
This will produce a multi-sample VCF with all the usual bells and whistles.
gendb:// prefix to the database input directory path. That's the only difference compared to a regular GenotypeGVCFs command, but it's an important one -- if you forget the prefix you will get a big fat error.
3. Extract data from a GenomicsDB datastore
If you want to generate a flat multisample GVCF file from a GenomicsDB you created, you can do so with SelectVariants as follows:
gatk SelectVariants \ -R data/ref/ref.fasta \ -V gendb://my_database \ -O combined.g.vcf
You can use any of the usual SelectVariants modifiers to extract e.g. only a subset of samples, a subset of genomic intervals, and so on. This can be useful for troubleshooting variant calls, when you feel the need to look at what the intermediate GVCF looked like, for example, since it's not possible to view the calls in the GenomicsDB itself in a human-readable way.
4. Incrementally add new GVCFs to an existing GenomicsDB datastore
If you want to add new GVCFs to an existing GenomicsDB datastore you can now do so using
gatk GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsdb-update-workspace-path existing_database
Note that we do not support updating existing samples. That is, the sample names must not be the same as any samples in the existing datastore. The user cannot specify intervals when incrementally adding new samples - in this case, the tool will use the intervals specified when the datastore was initially created. We recommend that users backup existing datastores before try incremental addition. This is because if the tool happens to fail when incrementally adding new samples, it may leave the datastore in a corrupt/invalid state.
If users do not have backup workspaces available while using --incremental, another potential failsafe is available if --consolidate option was not used while incrementally adding new samples. In this case, if the tool fails, it will leave behind a copy of the original callset file (suffixed .inc.backup) and a list of original fragment directories (suffixed .fragmentlist - containing a list of directories within the genomicsdb workspace that existed before the incremental import). If a failure occurs, the user can replace the callset file in the workspace with the original callset file (.inc.backup file) and delete all directories not named in the .fragmentlist file. Do not delete directories named genomicsdb_meta_dir
5. Generate interval_list for an existing GenomicsDB datastore
If you want to generate a Picard-style interval_list file with the intervals specified for the creation of a GenomicsDB datastore, you can do so with
gatk GenomicsDBImport \ --genomicsdb-update-workspace-path existing_database --output-interval-list-to-file /path/to/output/file
An interval_list file will be generated at
/path/to/output/file with the intervals used to generate the GenomicsDB datastore and an appropriate sequence header.