GVCF - Genomic Variant Call Format – GATK

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variant discovery analyses.

Important notes

The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller in GATK versions 3.x and 4.x contain additional information that is formatted in a very specific way. Read on to find out more.

GVCF files produced by HaplotypeCaller from GATK versions 3.x and 4.x are not substantially different. While we don't recommend mixing versions, and we have not tested this ourselves, it should be okay to use GVCFs made by different versions if the annotations and the GVCFBlock definitions (see below) are the same.

General comparison of VCF vs. GVCF

The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a GVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION GVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of GVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a GVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a GVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the GVCF header. The purpose of the blocks (also called banding) is to keep file size down, so we recommend using the -GVCF option over BP_RESOLUTION.

Example GVCF file

This is a banded GVCF produced by HaplotypeCaller with the -GVCF option.

Header:

As you can see in the first line, the basic file format is a valid version 4.2 VCF:

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

One FORMAT annotation is unique to the GVCF format:

##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">

This defines what was the minimum amount of coverage observed at any one site within a block of records.

The header goes on:

##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="[full command line goes here]",Version=4.beta.6-117-g4588584-SNAPSHOT,Date="December 23, 2017 4:04:34 PM EST">

At this point in the header we see the GVCFBlock definitions, which indicate the GQ ranges used for banding:

[individual blocks from 1 to 55]
##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive)
##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive)
##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive)
##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive)
##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive)
##GVCFBlock99-100=minGQ=99(inclusive),maxGQ=100(exclusive)

In recent versions of GATK, the banding strategy has been tuned to provide high resolution at lower values of GQ (59 and below) and more compression at high values (60 and above). Note that since GQ is capped at 99, records where the corresponding PL is greater than 99 are lumped into the 99-100 band.

After that, the header goes on:

##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=GRCh37>
##source=HaplotypeCaller

Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10001567 and ends at 20:10001616.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10001567    .   A   <NON_REF>   .   .   END=10001616    GT:DP:GQ:MIN_DP:PL  0/0:38:99:34:0,101,1114
20  10001617    .   C   A,<NON_REF> 493.77  .   BaseQRankSum=1.632;ClippingRankSum=0.000;DP=38;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=136800.00;ReadPosRankSum=0.170    GT:AD:DP:GQ:PL:SB   0/1:19,19,0:38:99:522,0,480,578,538,1116:11,8,13,6
20  10001618    .   T   <NON_REF>   .   .   END=10001627    GT:DP:GQ:MIN_DP:PL  0/0:39:99:37:0,105,1575
20  10001628    .   G   A,<NON_REF> 1223.77 .   DP=37;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=133200.00   GT:AD:DP:GQ:PL:SB   1/1:0,37,0:37:99:1252,111,0,1252,111,1252:0,0,21,16
20  10001629    .   G   <NON_REF>   .   .   END=10001660    GT:DP:GQ:MIN_DP:PL  0/0:43:99:38:0,102,1219

20  10001661    .   T   C,<NON_REF> 1779.77 .   DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00   GT:AD:DP:GQ:PGT:PID:PL:SB   1/1:0,42,0:42:99:0|1:10001661_T_C:1808,129,0,1808,129,1808:0,0,26,16

20  10001662    .   T   <NON_REF>   .   .   END=10001669    GT:DP:GQ:MIN_DP:PL  0/0:44:99:43:0,117,1755

20  10001670    .   T   G,<NON_REF> 1773.77 .   DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00   GT:AD:DP:GQ:PGT:PID:PL:SB   1/1:0,42,0:42:99:0|1:10001661_T_C:1802,129,0,1802,129,1802:0,0,25,17

20  10001671    .   G   <NON_REF>   .   .   END=10001673    GT:DP:GQ:MIN_DP:PL  0/0:43:99:42:0,120,1800
20  10001674    .   A   <NON_REF>   .   .   END=10001674    GT:DP:GQ:MIN_DP:PL  0/0:42:96:42:0,96,1197
20  10001675    .   A   <NON_REF>   .   .   END=10001695    GT:DP:GQ:MIN_DP:PL  0/0:41:99:39:0,105,1575

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Compressing file sizes with reblocking

Reblocking refers to the process of compressing GVCFs in order to reduce file storage footprints. This can help with joint genotyping pipelines by removing alt alleles that do not appear in the called genotype.

This can be implemented using the ReblockGVCF (BETA) in GATK, or the WARP VariantCalling WDL script which is imported by both the Exome and Whole Genome workflows in WARP.

4 comments

shameem

September 30, 2020 15:21
Hello All,

I tried to call GTAK 'HaplotypeCaller gvcf mode' and it ran successfully. Output gvcf file showing the following header information

##fileformat=VCFv4.2
##ALT=ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT"

I checked with some other gvcf file in online, including GATK example file, and it showing the ALT header as

##fileformat=VCFv4.2
##ALT=ID=NON_REF,Description="Represents any possible alternative allele at this location".

Please can somebody tell me that is there any mistake in my output gcVF file or is it normal ?

GTAK command used for Haplotype call** : gatk-4.1.7.0/gatk HaplotypeCaller --java-options '-Xmx10G' -R
pdsk_genome.fa -I w_00001.bam -O wgs_00001.g.vcf -ERC GVCF --minimum-mapping-quality=20 --min-base-quality-score=20 --tmp-dir=tmp
1

Comment actions Permalink
Jacob Shujui Hsu

March 27, 2021 10:36
How to determine if there is a missing position? Is this the only case?
```
GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
```
0

Comment actions Permalink
Jeremy Leipzig

November 21, 2022 19:25
I'd like to see what an unknown (./.) block looks like
0

Comment actions Permalink
Amaru

March 09, 2023 15:34
Hi shameem,

Did you ever figure it put what is the difference between these two?

Thanks
0

Comment actions Permalink

Please sign in to leave a comment.

Genome Analysis Toolkit

Need Help?

Community Forum

Articles in this section

GVCF - Genomic Variant Call Format Follow

Important notes

General comparison of VCF vs. GVCF

The two types of GVCFs

Example GVCF file

Header:

Records

Compressing file sizes with reblocking

4 comments

Genome Analysis Toolkit

Need Help?

Community Forum

Articles in this section

Important notes

General comparison of VCF vs. GVCF

The two types of GVCFs

Example GVCF file

Header:

Records

Compressing file sizes with reblocking

Related articles