GenomicsDB is a datastore format developed by our collaborators at Intel to store variant call data (where "datastore" = something that we mere mortals can think of as a database, even though IT professionals insist that it's a completely different thing). The long-term vision is that ultimately we will use this datastore format as an alternative to VCF files for storing and working with variant data. For now though, we are only actively using it as a GVCF consolidation tool in the germline joint-calling workflow.
There are currently five supported operations you can do with a GenomicsDB datastore: create a new GenomicsDB datastore from one or more GVCFs, joint-call it, extract sample data from it, add new GVCFs and generate an interval_list from an existing GenomicsDB datastore.
Contents
- Create a new GenomicsDB datastore from one or more GVCFs
- Joint-call samples in a GenomicsDB datastore
- Extract data from a GenomicsDB datastore
- Incrementally add new GVCFs to an existing GenomicsDB datastore
- Generate interval_list for an existing GenomicsDB datastore
1. Create a new GenomicsDB datastore from one or more GVCFs
The goal of this operation is to consolidate a set of GVCFs into a single datastore that GenotypeGVCFs
can run on (because GenotypeGVCFs
can only take a single input). To do this via GenomicsDB, we use the GenomicsDBImport
tool. This tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.
Here's what a typical command looks like:
gatk GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsdb-workspace-path my_database \ --intervals chr20,chr21
This command generates a directory called my_database
containing the combined GVCF data.
Note that the GVCFs can also be passed in as a list or map instead of being enumerated in the command.
2. Joint-call samples in a GenomicsDB datastore
Once you have a GenomicsDB datastore containing GVCF data from one or more sample, you can run GenotypeGVCFs on it to joint-call the samples it contains.
Here's an example command:
gatk GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \ -G StandardAnnotation -newQual \ -O test_output.vcf
This will produce a multi-sample VCF with all the usual bells and whistles.
Note the gendb://
prefix to the database input directory path. That's the only difference compared to a regular GenotypeGVCFs command, but it's an important one -- if you forget the prefix you will get a big fat error.
3. Extract data from a GenomicsDB datastore
If you want to generate a flat multisample GVCF file from a GenomicsDB you created, you can do so with SelectVariants as follows:
gatk SelectVariants \ -R data/ref/ref.fasta \ -V gendb://my_database \ -O combined.g.vcf
You can use any of the usual SelectVariants modifiers to extract e.g. only a subset of samples, a subset of genomic intervals, and so on. This can be useful for troubleshooting variant calls, when you feel the need to look at what the intermediate GVCF looked like, for example, since it's not possible to view the calls in the GenomicsDB itself in a human-readable way.
4. Incrementally add new GVCFs to an existing GenomicsDB datastore
If you want to add new GVCFs to an existing GenomicsDB datastore you can now do so using GenomicsDBImport
:
gatk GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsdb-update-workspace-path existing_database
Note that we do not support updating existing samples. That is, the sample names must not be the same as any samples in the existing datastore. The user cannot specify intervals when incrementally adding new samples - in this case, the tool will use the intervals specified when the datastore was initially created. We recommend that users backup existing datastores before try incremental addition. This is because if the tool happens to fail when incrementally adding new samples, it may leave the datastore in a corrupt/invalid state.
If users do not have backup workspaces available while using --incremental, another potential failsafe is available if --consolidate option was not used while incrementally adding new samples. In this case, if the tool fails, it will leave behind a copy of the original callset file (suffixed .inc.backup) and a list of original fragment directories (suffixed .fragmentlist - containing a list of directories within the genomicsdb workspace that existed before the incremental import). If a failure occurs, the user can replace the callset file in the workspace with the original callset file (.inc.backup file) and delete all directories not named in the .fragmentlist file. Do not delete directories named genomicsdb_meta_dir
5. Generate interval_list for an existing GenomicsDB datastore
If you want to generate a Picard-style interval_list file with the intervals specified for the creation of a GenomicsDB datastore, you can do so with GenomicsDBImport
gatk GenomicsDBImport \ --genomicsdb-update-workspace-path existing_database --output-interval-list-to-file /path/to/output/file
An interval_list file will be generated at /path/to/output/file
with the intervals used to generate the GenomicsDB datastore and an appropriate sequence header.
5 comments
Hi I was running a sample of over 1000 vcf files and the second joint calling step (as one call) was taking days - even weeks and doing that on a cloud is not cost effective. FAIL
It is mentioned
"You can use any of the usual SelectVariants modifiers to extract e.g. only a subset of samples, a subset of genomic intervals, and so on."
but it would be nice if you expand on larger use-cases - eg larger than one trio - much larger cohorts? Do we address highly mixed or structured diverged populations? Finnish and Ashkenazi Jewish + AMR??? Can we skip joint genotyping?
Locally on the HPC we split this into batches of 10 and by chromosome and it completed in a few hours - vs days/weeks. To make this useful larger datasets and how to effectively parallelize the joint calling of samples in step 2 needs a lot more content. Maybe provide more documentation on the use of the GenomicsDB for thousands or 10's of thousands of samples? As is, this leads the user to believe that they can run one call on the GenomicsDB and I just don't think that is feasible? Thank you for helping us get to a data format other than VCF! I would also like to know more about GATK using HAIL and alternatives for using that format like apache spark...Why did you choose this implementation over another?
The increment argument efficiently speeds up DB renewal:
But, is it possible to add a similar argument to joint genotyping? e.g.:
As the joint genotyping is the bottleneck on cohort scaling.
Hi all,
I made GenomicsDB sometime ago with specific intervals (using bed files for the regions I need from Agilent exome padded regions of interest) and now we decided to explore another region that is not one of the intervals I initially fed into GenomicsDB. Based on what I have read, looks like we can incrementally add samples to the genomicsDB but cannot incrementally add genomic intervals to GenomicsDB.
So I think I will have to make another GenomicsDB with that new interval using the same sample list as before. My question is , to know if this is the only solution,to remake genomicsDB with new interval or is there a shortcut to add the new interval to the exiting GenomicsDB ?
Thankyou!
It is not clear by this article what the file extension is for the GenomicsDB object, where it is shored, how to find it, or how to identify it.
Hi I am going to make GenomicsDB with 25K samples and do joint calling. Is it possible to do that with that much samples? if so, what is the most efficient way?
Thank you.
Please sign in to leave a comment.