GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.
This document explains what that extra information is and how you can use it to empower your variant discovery analyses.
The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with
--output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller in GATK versions 3.x and 4.x contain additional information that is formatted in a very specific way. Read on to find out more.
GVCF files produced by HaplotypeCaller from GATK versions 3.x and 4.x are not substantially different. While we don't recommend mixing versions, and we have not tested this ourselves, it should be okay to use GVCFs made by different versions if the annotations and the GVCFBlock definitions (see below) are the same.
General comparison of VCF vs. GVCF
The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a GVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.
Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the
BP_RESOLUTION GVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.
The two types of GVCFs
As you can see in the figure above, there are two options you can use with
BP_RESOLUTION, you get a GVCF with an individual record at every site: either a variant record, or a non-variant record. With
GVCF, you get a GVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the
##GVCFBlock line of the GVCF header. The purpose of the blocks (also called banding) is to keep file size down, so we recommend using the
-GVCF option over
Example GVCF file
This is a banded GVCF produced by HaplotypeCaller with the
As you can see in the first line, the basic file format is a valid version 4.2 VCF:
##fileformat=VCFv4.2 ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location"> ##FILTER=<ID=LowQual,Description="Low quality"> ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
One FORMAT annotation is unique to the GVCF format:
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
This defines what was the minimum amount of coverage observed at any one site within a block of records.
The header goes on:
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another"> ##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."> ##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="[full command line goes here]",Version=4.beta.6-117-g4588584-SNAPSHOT,Date="December 23, 2017 4:04:34 PM EST">
At this point in the header we see the GVCFBlock definitions, which indicate the GQ ranges used for banding:
[individual blocks from 1 to 55] ##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive) ##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive) ##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive) ##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive) ##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive) ##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive) ##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive) ##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive) ##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive) ##GVCFBlock99-100=minGQ=99(inclusive),maxGQ=100(exclusive)
In recent versions of GATK, the banding strategy has been tuned to provide high resolution at lower values of GQ (59 and below) and more compression at high values (60 and above). Note that since GQ is capped at 99, records where the corresponding PL is greater than 99 are lumped into the 99-100 band.
After that, the header goes on:
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities"> ##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval"> ##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity"> ##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"> ##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"> ##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality"> ##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias"> ##contig=<ID=20,length=63025520,assembly=GRCh37> ##source=HaplotypeCaller
The first thing you'll notice, hopefully, is the
<NON_REF> symbolic allele listed in every record's
ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.
The second thing to look for is the
END tag in the
INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10001567 and ends at 20:10001616.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 20 10001567 . A <NON_REF> . . END=10001616 GT:DP:GQ:MIN_DP:PL 0/0:38:99:34:0,101,1114 20 10001617 . C A,<NON_REF> 493.77 . BaseQRankSum=1.632;ClippingRankSum=0.000;DP=38;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=136800.00;ReadPosRankSum=0.170 GT:AD:DP:GQ:PL:SB 0/1:19,19,0:38:99:522,0,480,578,538,1116:11,8,13,6 20 10001618 . T <NON_REF> . . END=10001627 GT:DP:GQ:MIN_DP:PL 0/0:39:99:37:0,105,1575 20 10001628 . G A,<NON_REF> 1223.77 . DP=37;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=133200.00 GT:AD:DP:GQ:PL:SB 1/1:0,37,0:37:99:1252,111,0,1252,111,1252:0,0,21,16 20 10001629 . G <NON_REF> . . END=10001660 GT:DP:GQ:MIN_DP:PL 0/0:43:99:38:0,102,1219
20 10001661 . T C,<NON_REF> 1779.77 . DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,42,0:42:99:0|1:10001661_T_C:1808,129,0,1808,129,1808:0,0,26,16
20 10001662 . T <NON_REF> . . END=10001669 GT:DP:GQ:MIN_DP:PL 0/0:44:99:43:0,117,1755
20 10001670 . T G,<NON_REF> 1773.77 . DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,42,0:42:99:0|1:10001661_T_C:1802,129,0,1802,129,1802:0,0,25,17
20 10001671 . G <NON_REF> . . END=10001673 GT:DP:GQ:MIN_DP:PL 0/0:43:99:42:0,120,1800 20 10001674 . A <NON_REF> . . END=10001674 GT:DP:GQ:MIN_DP:PL 0/0:42:96:42:0,96,1197 20 10001675 . A <NON_REF> . . END=10001695 GT:DP:GQ:MIN_DP:PL 0/0:41:99:39:0,105,1575
Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).
Compressing file sizes with reblocking
Reblocking refers to the process of compressing GVCFs in order to reduce file storage footprints. This can help with joint genotyping pipelines by removing alt alleles that do not appear in the called genotype.
This can be implemented using the
ReblockGVCF (BETA) in GATK, or the Warp VariantCalling WDL script which is imported by both the Exome and Whole Genome workflows in WARP.