Better together
For germline short variants (SNPs and indels), we recommend performing variant discovery in a way that enables joint analysis of multiple samples, as laid out in our Best Practices workflow. That workflow includes a joint analysis step that empowers variant discovery by providing the ability to leverage population-wide information from a cohort of multiple sample, allowing us to detect variants with great sensitivity and genotype samples as accurately as possible. Our workflow recommendations provide a way to do this in a way that is scalable and allows incremental processing of the sequencing data.
The key point is that you don’t actually have to call variants on all your samples together to perform a joint analysis. We have developed a workflow that allows us to decouple the initial identification of potential variant sites (ie variant calling) from the genotyping step, which is the only part that really needs to be done jointly. This is the so-called "GVCF workflow", which utilizes a GVCF intermediate to allow scaling joint calling efficiently and conveniently.
As of GATK 3.0, you can use the HaplotypeCaller to call variants individually per-sample in -ERC GVCF
mode, followed by a joint genotyping step on all samples in the cohort, as described in this method article. This achieves what we call incremental joint calling, providing you with all the benefits of classic joint calling (as described below) without the drawbacks.
Previously established cohort analysis strategies
Until recently, three strategies were available for variant discovery in multiple samples:
- single sample calling: sample BAMs are analyzed individually, and individual call sets are combined in a downstream processing step;
- batch calling: sample BAMs are analyzed in separate batches, and batch call sets are merged in a downstream processing step;
- classic joint calling: variants are called simultaneously across all sample BAMs, generating a single call set for the entire cohort.
The best of these, from the point of view of variant discovery, was classic joint calling, because it provided the following benefits:
1. Clearer distinction between homozygous reference sites and sites with missing data
Batch-calling does not output a genotype call at sites where no member in the batch has evidence for a variant; it is thus impossible to distinguish such sites from locations missing data. In contrast, joint calling emits genotype calls at every site where any individual in the call set has evidence for variation.
2. Greater sensitivity for low-frequency variants
By sharing information across all samples, joint calling makes it possible to “rescue” genotype calls at sites where a carrier has low coverage but other samples within the call set have a confident variant at that location. However this does not apply to singletons, which are unique to a single sample. To minimize the chance of missing singletons, we increase the cohort size -- so that singletons themselves have less chance of happening in the first place.
3. Greater ability to filter out false positives
The current approaches to variant filtering (such as VQSR) use statistical models that work better with large amounts of data. Of the three calling strategies above, only joint calling provides enough data for accurate error modeling and ensures that filtering is applied uniformly across all samples.
Figure 1: Power of joint calling in finding mutations at low coverage sites. The variant allele is present in only two of the N samples, in both cases with such low coverage that the variant is not callable when processed separately. Joint calling allows evidence to be accumulated over all samples and renders the variant callable. (right) Importance of joint calling to square off the genotype matrix, using an example of two disease-relevant variants. Neither sample will have records in a variants-only output file, for different reasons: the first sample is homozygous reference while the second sample has no data. However, merging the results from single sample calling will incorrectly treat both of these samples identically as being non-informative.
Drawbacks of traditional joint calling (all steps performed multi-sample)
There are two major problems with the joint calling strategy.
- Scaling & infrastructure
Joint calling scales very badly -- the calculations involved in variant calling (especially by methods like the HaplotypeCaller’s) become exponentially more computationally costly as you add samples to the cohort. If you don't have a lot of compute available, you run into limitations pretty quickly. Even here at Broad where we have fairly ridiculous amounts of compute available, we can't brute-force our way through the numbers for the larger cohort sizes that we're called on to handle.
- The N+1 problem
When you’re getting a large number of samples sequenced (especially clinical samples), you typically get them in small batches over an extended period of time, and you analyze each batch as it comes in (ie. because the analysis is time sensitive). But that’s not joint calling, that’s batch calling, and it doesn’t give you the same significant gains that joint calling can give you. Unfortunately the joint calling approach doesn’t allow for incremental analysis -- every time you get even one new sample sequence, you have to re-call all samples from scratch.
10 comments
Hello,
Thank you for the explanation. I had a question about what type of samples can I include in a cohort-type analysis.
Assuming that they are all WES (same panel) and Same ancestry (Just sequenced in separate batches)
If my cohort has 100 cases with 8 families and sporadic then should I run them all together or divide them into batches of families and one for sporadic?
The other question I have is: If I have 100 cases and 40 controls can I do the 'join genotyping' together or do it separately?
Thank you for your help.
Hello,
Could anyone explain the meaning of "rescueing genotype calls" under section titled "2. Greater sensitivity for low-frequency variants"? Does it mean the genotype calls of the low-frequency variants would be more accurate? Or does it mean that the low frequency variant site would be detected more easily?
Thanks for your help.
This is such a great explanation
Best practices link results in a 404 error
https://gatk.broadinstitute.org/hc/en-us/articles/360035890431/comments/360013216472
I have the same issue with 404 for the Best Practices link
I had the same query for joint call cohort. I assume that the diseased and case control samples should not be merged for cohort calling thinking that the high confidence SNP in case control samples might influence the variant calling procedure.
I have a thousand whole-genome sequencing VCF files. Is there any way to figure out and be sure if these VCF files are individually called or jointly called? Is there any line in the header to look at for this?
@Saeed GATK will leave the commands used to call the genotypes in the header, so you can infer from there whether your vcfs we're joint-called or not.
@Degang I have this line in the header for GenotypeGVCFs:
But I am not sure what input has been provided for GenotypeGVCFs ( 1. a single single-sample GVCF 2. a single multi-sample GVCF created by CombineGVCFs or 3. a GenomicsDB workspace created by GenomicsDBImport). Is there any specific commandline for this?
As mentioned in replies above sent in 2021, the link to best practices still seems to generate a 404 error.
Please sign in to leave a comment.