DNA sequencing continues to get cheaper and cheaper as time goes on, which means that we can sequence more samples for the same money. However, as cohorts get larger and larger, running any joint calling pipeline gets… challenging. At Broad, we ran into this issue several years ago working with 20,000 samples for the Centers for Common Disease Genetics. We were able to power through, but then we got to the record-breaking ~150,000 sample gnomADv3 callset. At that point we had to make some big changes to the joint calling pipeline. Now that more of our GATK users are running into scaling issues themselves, it's time to take those changes out of the supplement and into the spotlight with the GATK "Biggest Practices".
Joint calling is the aggregate of several different components: joint processing, joint discovery, and joint filtering with the goal of what I'm going to call joint representation. In this blog post we'll look at all of those steps and what has changed for callsets with tens of thousands of samples.
The "single-sample" part of the pipeline scales very nicely since it's embarrassingly parallel, as they say in computer science. However, it's very important that all of the samples be processed in the same way, no matter where they were sequenced. (We talked about this at a single-sample level in the context of functional equivalence between DRAGEN and DRAGEN-GATK. We'll talk about assessing functional equivalence via joint calling in a future blog post.)
Recommendations for all of these steps are laid out in the Functional Equivalence specification by the Centers for Common Disease Genomics. Reaching consensus on those rules and guidelines was an arduous effort by many people across the five CCDG sequencing centers, but the result ensures that all the data produced is co-analyzable without the dreaded Reprocessing.
- First decide on a reference – the reference should definitely be the same for all your samples right down to the patch version. DRAGEN-GATK and the CCDG FE standard use GRCh38, which has the added complexity of all the alternate contigs. Different aligners will have different ways of dealing with the duplicated and/or similar sequence surrounding the alternate contigs.
- We highly recommend using the same aligner for all samples. How different can two aligners be? We’ve seen that different choices about secondary alignment and alternate contigs can increase or decrease false positive SNP and indel calls. It’s easy to imagine how different Smith-Waterman alignments might put variants in different places or give them different representations within a complex region, though one would like to think that local assembly in a tool like HaplotypeCaller would make that irrelevant.
- Marking duplicates should be done consistently across samples as well. Specifically, it's important that the two pipelines being compared choose the representative duplicate in a set in the same way. Duplicate reads don't have to be identical – they just need the same start and mate start. That means that one could be riddled with mismatches, but we would typically prefer the read (pair) that has better base qualities and fewer mismatches.
- At this point we can compute a lot of informative metrics on our samples, including estimated cross-sample contamination and percentage of chimeric reads. Blacklist cutoffs for read and alignment quality metrics should be established for all samples. We continue to run recalibration of base qualities, though modern sequencers are doing a much better job producing well calibrated quality scores.
- Finally comes single-sample variant calling to produce GVCFs. Aside from entirely new local assembly code, even minor changes and bug fixes from version to version could potentially yield slightly different alignments and/or representations of variants, so we advise using the same version of GATK HaplotypeCaller whenever possible.
In the beginning, there were low coverage genomes. The diversity of the HapMap and Thousand Genomes Projects, as well as the sheer quantity of samples available, represented a vast treasure trove of genetic variation that we were proud to provide to researchers.
Nowadays we’d be lucky to entice anyone with the lackluster ~5X average coverage. (Incidentally, this is why new 30X 1000G genomes were recently generated.) That said, we were still able to discover a lot of global genetic variation from those low coverage samples, but how did we do that with so few reads?
Part of what helped is that we were going after common variation. From a population genetics standpoint, a variant that is seen in more than one sample is probably inherited from a common ancestor, so then the data from both of those samples support the same variant.
Since more variant reads in more samples means more evidence, our site-level variant quality was calculated in such a way that it kept increasing the more supporting reads we found. Once the quality exceeded the probability of seeing a variant at any given position by chance (about 1 in 1000 for SNPs and 1 in 8000 for indels in humans) we would output the variant .
With modern 30X data we can easily exceed the 1 in 1000 quality threshold with just the data from a single sample. The issue we run into is that because the QUAL score is effectively additive across samples, when evidence is weak because the variant is not real, we can still accumulate evidence until it passes the threshold. We see this in indel precision as we increase the number of samples in the cohort (bottom right panel):
This effect is dominated by low allele balance indel artifacts due to mapping error. These artifacts don't achieve high enough quality (and don't get output) until the number of samples increases. Using evidence across samples to discover variants may increase sensitivity when coverage drops below our expected targets, but comes at the cost of picking up more indel false positives in larger cohorts.
Reblocking and the GnarlyGenotyper
As we increased the size of our joint calling cohorts from tens of thousands to hundreds of thousands we quickly hit a wall in terms of computational burden. There was too much data to copy, combining samples took too long, and calculating the site-level QUAL score used way too much time and memory.
We used a variety of approaches to get our costs and runtimes down:
- First, we compressed (in a lossy way) many of the adjacent reference blocks in the GVCFs with a new tool called ReblockGVCF. For gnomADv3 we did this in a very aggressive way with two bands of quality: pass >=20 and fail < 20. Exomes especially benefit from this compression since the coverage and thus GQ tends to vary along a target.
- To reduce the number of alleles for which we had to combine data and output PLs, we use the same ReblockGVCF tool to remove any alternate alleles that aren't called or are called with a GQ less than 20. The GQ 20 threshold helps us deal with the bad variants accumulating evidence as described above. The reduction in alleles decreases the merge time, but also decreases the compute time for the QUAL score.
- To further speed things up, the GnarlyGenotyper (with the help of an output annotation added by ReblockGVCF) calculates an approximation for QUAL from the INFO field without iterating over the genotypes. This saves a lot of time calculating QUAL, which was only used as a minimum threshold and in the QD annotation.
We call these changes the "Biggest Practices" because there is some information being lost in comparison with the Best Practices, but without these optimizations, genomic data releases like gnomADv3, UKBiobank, All of Us, and many more would not have been possible.
In the "Biggest Practices" applied to modern, 30X PCR-free genomes we still do joint filtering. Because we combine data across samples to produce the statistics we use for filtering features or annotations, seeing a variant in more samples and combining the data across those samples gives us better statistical power. An illustrative toy example is given below.
The hypothesis underlying the VQSR model is that the annotation distributions for artifacts/errors have different means compared with the distributions for true variants. Specifically, we're looking at the distribution of sample means. As such, the variance of the sample mean will decrease with the number of observations. Theoretically, the more data we can get for a variant from more samples, the tighter its annotation distribution will be, which we have also seen empirically.
In the example above, singletons seen only in one sample have the widest distribution, which is difficult to distinguish from the distribution of errors, while common variants with lots of samples' worth of data have the tightest distribution and the lowest chance of false positives, based on the overlap of their distributions. Given this improvement of filtering accuracy with more observations, we definitely want to leverage that in our joint callset, so we do combine data across samples for filtering purposes.
From a technical standpoint, we don't need genotype-level data for this kind of filtering, so combining and using data across samples for filtering is a relatively low computational cost no matter how large our cohorts get.
VariantRecalibrator "scatter mode"
For the "Biggest Practices", we did have to make some relatively minor changes to the way we run VQSR. VQSR traditionally ran as a single task once all the data was merged, but with whole genome cohorts in the tens of thousands, that used way too much memory. Now we parallelize the VQSR steps so we can continue to scale filtering.
Regardless of which application you use for combining GVCFs — whether it be GATK's CombineGVCFs, GenomicsDB, hail, or the experimental, bigQuery-based Genomic Variant Store (GVS) — you'll still be able to get information about every sample at every (variant) position, since you're starting with GVCFs. This is particularly important in getting accurate allele frequency estimates.
Is that variant of interest really rare (i.e. not present in most other samples) or are the other samples just lacking coverage at that position? This figure from our GATK workshop slides summarizes the difference quite nicely:
Provided one of the samples in the cohort has a variant at that position, we'll be able to distinguish between the top case as high confidence homozygous reference and the bottom case as no-call or GQ0 in the Biggest Practices. This yields more accurate estimates of allele frequency and helps us interpret variants in the context of the whole population. Joint representation also has the benefit of left-aligning indels and outputting all variants that start at the same position in the same VCF line, though the former is also the responsibility of the joint processing.
The process of "joint calling" is actually many joint steps referred to together. Data that's analyzed together should always be processed with the same single-sample pipeline whenever possible. We used to use data across samples to maximize our sensitivity, but with modern ~30X samples this mostly just results in accumulating false positives in larger cohorts.
The inclusion of the new ReblockGVCF and GnarlyGenotyper tools in the "Biggest Practices" mitigates that false positive increase while allowing the joint calling pipeline to scale to hundreds of thousands of samples. Meanwhile, filtering continues to leverage data from multiple samples for better results.
We've always recommended that cohorts for joint calling include at least 50 samples, and that hasn't changed. But the next time you find yourself heaving under the weight of 10,000 samples, try reaching for the GATK Biggest Practices with ReblockGVCF and GnarlyGenotyper.
In fact, for our own production pipelines, we make the switch to the Biggest Practices at 2,000 samples, which is what you should be doing for the best cost efficiency.
That said, as we make the switch to our Genomic Variant Store, we'll be using the Biggest Practices for all callsets.