(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs – GATK

"Using a GenomicsDB with SelectVariants or GenotypeGVCFs from google cloud storage requires a certain prefix to the genomicsdb name which is not in our documentation. It should be added, not sure where would be most helpful though. Maybe here?

In GATK4, the GenotypeGVCFs tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with -ERC GVCF or -ERC BP_RESOLUTION.

Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport. We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).

Using `GenomicsDBImport` in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So, if you have a trio of GVCFs your GenomicsDBImport command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20,chr21

That generates a directory called my_database containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like Buffer resized from 178298bytes to 262033 -- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O test_output.vcf

And that's all there is to it.

Important limitations and Common “Gotchas”:

At least one interval must be provided when using GenomicsDBImport.
Input GVCFs cannot contain multiple entries for a single genomic position.
GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.
The annotation counts specified in the header MUST BE VALID! If not, you may see an error like A fatal error has been detected by the Java Runtime Environment [...] SIGSEGV with mention of a core dump (which may or may not be output depending on your system configuration.) You can check your annotation headers with vcf-validator from VCFtools
GenomicsDB will not overwrite an existing workspace. To rerun an import, you will have to manually delete the workspace before running the command again.
If you’re working on a POSIX filesystem (e.g. Lustre, NFS, xfs, ext4 etc), you must set the environment variable TILEDB_DISABLE_FILE_LOCKING=1 before running any GenomicsDB tool. If you don’t, you will likely see an error like Could not open array genomicsdb_array at workspace:[...]
HaplotypeCaller output containing MNPs cannot be merged with CombineGVCFs or GenotypeGVCFs. For phasing nearby variants in multi-sample call sets, MNPs can be inferred from the phase set (PS) tag in the FORMAT field.
There are a few other, rare bugs we’re in the process of working out. If you run into problems, you can check the open github issues to see if a fix is in in progress.

If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.

Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

For a GenomicsDB workspace in a bucket, you need to use the gendb-gs:// prefix to access it.

Bells and Whistles

GenomicsDB now supports allele-specific annotations, which have become standard in our Broad exome production pipeline.

GenomicsDB can now import directly from a Google cloud path (i.e. gs://) using NIO.

7 comments

Robert Butler

April 01, 2020 00:03
The syntax in this is incorrect, a list of comma separated intervals will not work: `--intervals chr20,chr21`

Instead, it seems like a list file is required?: https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists
1

Comment actions Permalink
Junhao Su

April 08, 2021 02:48
an error occurred when using joint genotyping, seems the -newQual is not a valid option for `GenotypeGVCFs`

error message:

A USER ERROR has occurred: -newQual is not a recognized option
1

Comment actions Permalink
Jonathan Klonowski

October 04, 2021 18:04
can you append a gvcf to a previously created gendb database?

for example:
```
gatk GenomicsDBImport \
    -V gendb://my_database 
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20,chr21
```
So that you dont have to rerun GenomicsDBImport on all samples each time you have new samples genotyped?
1

Comment actions Permalink
Sandra Bohn

October 22, 2021 19:30

Edited
I tried this with the intention of combining gVCFs and using SelectVariants (as described in the addendum) with the --concordance flag to extract a set of variants from my gVCFs. However, the output had all missing data for the GT fields, and after some searching I found that SelectVariants does not output the GT field. Is there a way to extract GTs without using the genotyper? My gVCFs were produced by another caller.

Also, I could not get chr20,chr21 to work either.
1

Comment actions Permalink
Jonathan Klonowski

May 20, 2022 18:13
No one actually looks at the comments. Lol. Useless.
7

Comment actions Permalink
jingchun sun

November 30, 2022 02:34
I have combined some GVCFs by GenomicsDBImport, and continuity added other samples. however, I find some mistakes with previous samples, how do I delete these samples from GATK GenomicsDBImport?
1

Comment actions Permalink
Michael

March 10, 2023 10:53

Edited
gatk GenomicsDBImport \
-V data/gvcfs/mother.g.vcf \
-V data/gvcfs/father.g.vcf \
-V data/gvcfs/son.g.vcf \
--genomicsdb-workspace-path my_database \
--intervals chr20,chr21

The example is possibly misleading because --intervals don't seem to take a comma-separated list of arguments, at least with --intervals NC_001133.9,NC_001134.8 in GATK v4.2.6.1 this gives:

A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval "NC_001133.9,NC_001134.8" is not valid for this input.

Maybe this is because of the formatting of the refseq ids,

OR the example should possibly be given as

gatk GenomicsDBImport \
-V data/gvcfs/mother.g.vcf \
-V data/gvcfs/father.g.vcf \
-V data/gvcfs/son.g.vcf \
--genomicsdb-workspace-path my_database \
--intervals chr20 --intervals chr21
0

Comment actions Permalink