This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above. For a more detailed discussion of why it's better to perform joint discovery, see this FAQ article. For more details on how this fits into the overall reads-to-variants analysis workflow, see the Best Practices workflows documentation.
Overview
This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.
In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF
mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).
In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort.
This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.
This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".
Workflow details
This is a quick overview of how to apply the workflow in practice. For more details, see the Best Practices workflows documentation.
1. Variant calling
Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the option --emitRefConfidence GVCF
, and using the .g.vcf
extension for the output file.
Note that versions older than 3.4 require passing the options --variant_index_type LINEAR --variant_index_parameter 128000
to set the correct index strategy for the output gVCF.
2. Data aggregation step
A new tool called GenomicsDBImport is necessary to aggregate the GVCF files and feed in one GVCF to GenotypeGVCFs. You can read more about it here. You can also run CombineGVCFs if you are not able to use GenomicsDBImport.
3. Joint genotyping
Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.
4. Variant recalibration
Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.
That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.
As always, we look forward to comments and observations from the research community!
4 comments
All of the links that I've come across for the Best Practices Workflow are 404'd.
(e.g. https://gatk.zendesk.com/hc/en-us/articles/360035894751)
It is not clear at what step do you add the information from a new batch of genotyped samples to the previously genotyped information to solve the N + 1 problem. Do I run GenomicsDBImport on the new "N + 1" batch and then run GenotypeGVCFs on all the GVCFS, new and old? Or the way that this workflow solves the problem is in the previous step, HaplotypeCaller?
If all (the original N and the new N+1) the gVCFs have to be run together with GenomicsDBImport and GenotypeGVCFs, the amount of resources and time it takes to run GenomicsDBImport to just add a handful of samples to a larger, previously called, sample set is high.
Running this workflow results in genotype calls defaulting to the reference Allele at sites without reads (DP of 0). Is there a flag to make those uncalled? As in a GT of ./.
Is this workflow applicable to RNAseq data?
Please sign in to leave a comment.