"Using a GenomicsDB with SelectVariants or GenotypeGVCFs from google cloud storage requires a certain prefix to the genomicsdb name which is not in our documentation. It should be added, not sure where would be most helpful though. Maybe here?
In GATK4, the GenotypeGVCFs
tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs
. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with -ERC GVCF
or -ERC BP_RESOLUTION
.
Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly: GenomicsDBImport
and CombineGVCFs
.
GenomicsDBImport
is the preferred tool (see detailed instructions below); CombineGVCFs
is provided only as a backup solution for people who cannot use GenomicsDBImport
. We know CombineGVCFs
is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport
first and only fall back on CombineGVCFs
if you experience issues that we are unable to help you solve (ask us for help in the forum!).
Using GenomicsDBImport
in practice
The GenomicsDBImport
tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs
can then read from the created GenomicsDB directly and output the final multi-sample VCF.
So, if you have a trio of GVCFs your GenomicsDBImport
command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):
gatk GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsdb-workspace-path my_database \ --intervals chr20,chr21
That generates a directory called my_database
containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like Buffer resized from 178298bytes to 262033
-- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.
Then you run joint genotyping; note the gendb://
prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.
gatk GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \ -O test_output.vcf
And that's all there is to it.
Important limitations and Common “Gotchas”:
- At least one interval must be provided when using
GenomicsDBImport
. - Input GVCFs cannot contain multiple entries for a single genomic position.
GenomicsDBImport
cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.- The annotation counts specified in the header MUST BE VALID! If not, you may see an error like
A fatal error has been detected by the Java Runtime Environment [...] SIGSEGV
with mention of a core dump (which may or may not be output depending on your system configuration.) You can check your annotation headers with vcf-validator from VCFtools - GenomicsDB will not overwrite an existing workspace. To rerun an import, you will have to manually delete the workspace before running the command again.
- If you’re working on a POSIX filesystem (e.g. Lustre, NFS, xfs, ext4 etc), you must set the environment variable
TILEDB_DISABLE_FILE_LOCKING=1
before running any GenomicsDB tool. If you don’t, you will likely see an error likeCould not open array genomicsdb_array at workspace:[...]
- HaplotypeCaller output containing MNPs cannot be merged with CombineGVCFs or GenotypeGVCFs. For phasing nearby variants in multi-sample call sets, MNPs can be inferred from the phase set (
PS
) tag in the FORMAT field. - There are a few other, rare bugs we’re in the process of working out. If you run into problems, you can check the open github issues to see if a fix is in in progress.
If you can't use GenomicsDBImport
for whatever reason, fall back to CombineGVCFs
instead. It is slower but will allow you to combine GVCFs the old-fashioned way.
Addendum: extracting GVCF data from the GenomicsDB
If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants
as follows:
gatk SelectVariants \ -R data/ref/ref.fasta \ -V gendb://my_database \ -O combined.g.vcf
For a GenomicsDB workspace in a bucket, you need to use the gendb-gs://
prefix to access it.
Bells and Whistles
GenomicsDB now supports allele-specific annotations, which have become standard in our Broad exome production pipeline.
GenomicsDB can now import directly from a Google cloud path (i.e. gs://
) using NIO.
7 comments
The syntax in this is incorrect, a list of comma separated intervals will not work: `--intervals chr20,chr21`
Instead, it seems like a list file is required?: https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists
an error occurred when using joint genotyping, seems the -newQual is not a valid option for `GenotypeGVCFs`
error message:
A USER ERROR has occurred: -newQual is not a recognized option
can you append a gvcf to a previously created gendb database?
for example:
So that you dont have to rerun GenomicsDBImport on all samples each time you have new samples genotyped?
I tried this with the intention of combining gVCFs and using SelectVariants (as described in the addendum) with the --concordance flag to extract a set of variants from my gVCFs. However, the output had all missing data for the GT fields, and after some searching I found that SelectVariants does not output the GT field. Is there a way to extract GTs without using the genotyper? My gVCFs were produced by another caller.
Also, I could not get chr20,chr21 to work either.
No one actually looks at the comments. Lol. Useless.
I have combined some GVCFs by GenomicsDBImport, and continuity added other samples. however, I find some mistakes with previous samples, how do I delete these samples from GATK GenomicsDBImport?
The example is possibly misleading because --intervals don't seem to take a comma-separated list of arguments, at least with --intervals NC_001133.9,NC_001134.8 in GATK v4.2.6.1 this gives:
Maybe this is because of the formatting of the refseq ids,
OR the example should possibly be given as
gatk GenomicsDBImport \
-V data/gvcfs/mother.g.vcf \
-V data/gvcfs/father.g.vcf \
-V data/gvcfs/son.g.vcf \
--genomicsdb-workspace-path my_database \
--intervals chr20 --intervals chr21
Please sign in to leave a comment.