What happens if you don’t joint call all your samples together? Why do people spend so much time and money reprocessing all the samples they can get their hands on through the same joint calling pipeline?
This is the time when, invariably, one of the more senior bioinformaticians in attendance shakes their head and solemnly utters, “batch effects”. Typically this is a conversation stopper. Who would advocate in favor of batch effects? Not you, surely! So it’s decided. We’ll reprocess all the samples.
But what if we didn't have to?
How can we detect batch effects?
In order to explore the differences between two different variant calling pipelines, we decided to do a data experiment of sorts. (We'll get to the mixed callset analysis in the followup to this blog post.) We designed an analysis using public genomic data to compare two widely-used pipelines in the context of common types of analyses.
Setting up the experiment
The two single-sample pipelines to be compared are the Functional Equivalence (FE) pipeline (as described in the FE paper) and the DRAGEN-GATK variant calling (WholeGenomeGermlineSingleSample.wdl, equivalent to DRAGEN 3.4.12). Both pipelines were run from BWA-aligned reads and output "reblocked" GVCFs. GVCFs from each pipeline were joint called with GenomicsDBImport, GenotypeGVCFs, and VQSR. The differences between the two pipelines were limited to variant calling, contrasting GATK 3.5 HaplotypeCaller with the DRAGEN-equivalent version including an STR-specific indel model. (Read more details in our DRAGEN-GATK Update blog post.)
Read data was taken from the NHGRI AnVIL high coverage 1000 Genomes release. The samples were subset to mother-father-offspring trios such that no sample occurred in more than one trio. This resulted in 601 trios. Results shown here are from chromosome 18.
One of the best ways to compare tools or pipelines is to use "well characterized samples", like the NIST Genomes in a Bottle samples, including NA12878 shown here. The GiaB samples have been analyzed using a variety of different tools and sequencing technologies leading to highly confident variant calls and also highly confident reference regions that help us estimate precision in variant calling. Similar comparisons could be conducted with any sample that's been sequenced and analyzed before, but take those comparisons with a small grain of salt.
DRAGEN is (slightly) more sensitive to NA12878 variants
Comparing our FE pipeline with DRAGEN, the DRAGEN results have slightly superior sensitivity and precision. The default VQSR filtering is a little too aggressive in both cases, as we can see from the big discrepancy between filtered and unfiltered recall (sensitivity).
Digging into some of the sensitivity differences between FE and DRAGEN, some of the false negatives in the FE callset are in regions with low mappability because HaplotypeCaller doesn't use reads with MQ < 20. There may also be a few instances where DRAGEN's "pileup caller" rescues assembly failures.
Most variants are shared between pipelines
We used the callsets from the two different pipelines to compare allele concordance (without being strict about genotype concordance – an allele present at the same position in both callsets is a match.)
Here "all filters" means:
- VQSR PASS
- probability of excess heterozygosity by chance > 1e-6
- adjusted genotype quality or "adj" for GTs – DP >= 10, GQ >= 20, het allele fraction > 0.2
- AC > 0 (post-adj)
- call rate > 80%
- QD > 2 (for GATK).
Precision here is given with respect to the other callset. For example, "all-FE 0.98 precision" means that 98% of the SNPs in the (filtered) all-FE callset were also seen in the (filtered) DRAGEN callset. Vice versa happens to also be 98% in this case.
As is so often the case, the results for indels aren't quite as pleasing as for SNPs, but still reasonable. Filters here were tuned slightly to account for the overfiltering we saw above:
- VQSLOD > -13.9404 (99.9% sensitivity to common variants)
- call_rate > 80%
- probability of excess heterozygosity by chance > 1e-6
- QD > 2 (for FE)
Without filters the DRAGEN precision improves, indicating that default VQSR is probably overfiltering the FE callset. (Note the significantly higher total variant count for unfiltered.)
To see if these differences are likely to make an impact on downstream analysis, we turn to a previous publication comparing bioinformatic variability with the variance inherent to library creation and sequencing.
Functional Equivalence paper comparison
Part of the Centers for Common Disease Genomics project included work to determine the effects of analyzing data generated and processed at different sequencing centers. Authors of the 2018 paper (https://www.nature.com/articles/s41467-018-06159-4) decided that the data could be determined to be Functionally Equivalent if the genomic variants called from the same reads processed at two different sequencing centers by their respective pipelines have better concordance than two different biological samples for the same individual (biological replicates) processed with the same pipeline. They were able to reach a consensus on nearly all the tools and parameters in their read alignment and calibration pipelines such that the results were in better agreement than if the library prep, sequencing, and variant calling had been run twice. Even with 30X coverage PCR-free samples, there was a surprising amount of variability between biological replicates.
Differences between pipelines are less than between duplicates
By this same logic, that same FE pipeline is Functionally Equivalent with the (BWA-)DRAGEN pipeline if the variant discordance is lower than for two biological replicates. As we can see in the table, the FE and DRAGEN pipelines showed higher discordance than FE pipelines from two different centers, but less than the biological replicates. This indicates that the data from the two pipelines are suitable to be analyzed together.
|SNV discordance||Indel discordance|
|FE vs. DRAGEN||4.2+/-0.3%||4.6+/-0.2%|
Callset variant (site) concordance is good with low variance
In the Functional Equivalence experiment, 100-sample SNP and indel callsets were generated using GATK and compared between centers. Not accounting for alleles or genotypes, variants were considered concordant if a sample exhibited a variant at that position in both callsets. Genomic territory was broken up into regions by difficulty of variant calling using data like concordance of different variant calling tools on the NA12878 sample, centromeres, segmental duplications, and common CNVs (see https://www.nature.com/articles/s41467-018-06159-4#Sec7 for details). We did similar comparisons between our two pipelines, stratified by genomic territory.
These results compare nicely with the boxplots in the FE paper figure 2a:
The variance for the data generated here is significantly lower, but the boxplot means (across samples) are very similar to the data from the FE paper. Our variance is likely lower owing to using the same aligned crams as input to the variant calling in each pipeline.
Mendel error counts per sample are close between pipelines and lower than in the FE paper
The Functional Equivalence paper stratified Mendel error (ME) counts by concordant and discordant variants, such that variants that occurred at all in both callsets were considered concordant and variants that occurred in one callset and not the other were considered discordant. MEs were also counted only over informative sites where the parent genotypes were some combination other than two hets, because in that case any offspring genotype would be possible. For SNVs, the rate of MEs per informative site was about 0.3% for concordant variants and 22-24% at discordant sites. For indels, the rates were even higher with concordant indels showing about 15% MEs over all genomic territory and discordant ME rates up to 50% (see their figure 2, panel b). In our data, we found 0.07% to 0.1% for discordant SNVs for both FE and DRAGEN with concordant ME rates as low as 7e-4%. Indel ME rates were slightly higher, reaching a maximum of 0.09% for DRAGEN indels over hard regions, which is about 80% lower than in the FE paper. We've done our best to replicate the methods described in the FE paper, but the discrepancy in orders of magnitude is a slight concern. One significant difference is that the target coverage for the FE paper was 20X while the NYGC genomes aimed for at least 30X. NYGC 1000G samples processed here and FE samples are all PCR-free. However, the good agreement between the ME rates for the all-FE and all-DRAGEN callsets generated here give us confidence in the compatibility of our two tested pipelines.
FE and GATK-DRAGEN 3.4.12 pipelines show good concordance
We talked before about why it's beneficial to joint call larger cohorts of samples: improved sensitivity to rare variants, standardized variant representations (poster and biggest practices links)
The Regier et al. Functional Equivalence paper gives us a more quantitative framework for evaluating differences via comparison to results from the same samples sequenced twice. Compared to the variance of those biological replicates, the differences seen here between the FE and GATK-DRAGEN pipelines are acceptable for applications such as variant discovery. Adding more samples from the new pipeline to your cohort of interest, for example, won't be harming your ability to find and confirm variants. However, that hasn't been the case for all the pipelines we've compared. Looking at our classic GATK single-sample pipeline on the NovaSeq6000 versus DRAGEN 3.7.8 on the NovaSeqX, we see a lot more discordance between the pipelines than for biological replicates, for both SNPs and indels.
|NV6000 FE vs NVX DRAGEN 3.7.8||11.2+/-0.5%||39.0+/-1.6%|
In that case, there will be false positive variants in NV6000 FE samples that never show up again in NVX DRAGEN and NVX DRAGEN variants that were missed before in NV6000 FE data. We declared the NovaSeqX DRAGEN 3.7.8 pipeline to be incompatible with the NovaSeq6000 FE pipeline, but believe that DRAGEN 3.7.8 is an improvement. Here we did recommend reprocessing of the old samples so that all data is derived from DRAGEN 3.7.8.
The treatment of concordance in this analysis is under the assumption that the errors are randomly distributed, or at least random in each genomic region. This is a good model for projects where we're surveying population variation or trying to assess the frequency of disease variants. However case-control studies are primed to pick up on any differences between the case samples and the control set. In our next blog installment, we'll look at a few more similar publications that highlight methods for detecting specific variants exhibiting batch effects.
Regier et al. "Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects" 2018 https://www.nature.com/articles/s41467-018-06159-4
A Jupyter notebook for the analysis contributing the results here and in part two is available on Terra at https://app.terra.bio/#workspaces/broad-firecloud-dsde-methods/joint-calling-FE-versus-dragen-batch-effect-exp