How do people deal with batch effects?
Batch effects encompass all the ways that non-experimental factors sneak their way into affecting your experimental data. Naturally, they’re something we want to sterilize out of our research as much as possible, but it can be a challenge identifying what is or is not a batch effect.
In our previous blog post, we discussed the ways that we can compare pipeline outputs using the same data run both ways. We designed an experiment comparing joint callsets of GVCFs from two widely-used pipelines to try and get to the heart of the matter.
In this blog post, we are going to discuss what to look out for, and what you can (probably) safely ignore.
The Functional Equivalence standard is a great way to quantify pipeline differences and has very convenient thresholds for pass/fail determination, but it assumes that errors are randomly distributed across the genome and not correlated with reference context, common variation, or anything else.
Atkinson et al. gnomAD allele count comparison
A few years ago, some folks working with the gnomADv2 data noticed that there were a large number of variants for which the allele counts in the exomes and genomes were strikingly different. Elizabeth Atkinson went about finding these sites in a methodical manner with a chi-squared test. Since publication of that paper, these discordant sites have been flagged on the gnomAD browser so that their data can be taken with a grain of salt.
In the case of the gnomAD exome versus genome data, there are a number of potential sources of differences. Read length differences don't apply to our batch effect experiment here since we're starting with the same reads, and in the case we used the same aligner, but the Dragmap aligner that's part of the typical DRAGEN pipeline does differ from BWA in some ways.
Allele frequency concordance as measured by a chi-squared test was pretty good for both SNPs and indels, but could be further improved by limiting SNPs to biallelic variants and by limiting indels to only regions outside the low complexity reference context intervals.
Conditions | # significant hits | total | Qscore |
filtered SNPs | 856 | 2,156,069 | Q34 |
filtered, biallelic SNPs | 68 | 2,057,036 | Q45 |
filtered indels | 34 | 325,275 | Q27 |
filtered, non-LCR indels | 61 | 168.468 | Q34 |
TOPMed batch effect analysis
Much like CCDG, TOPMed combines large cohorts of data sequenced at a number of different sequencing centers. As described in the supplement of their 2021 paper, subset of their samples were sequenced at two different centers and they used PCA to show that their duplicate samples are similarly located in PCA-space. Here both callsets have the same outliers in population PCA analysis.
Once populations have been identified (or in our case, referenced from the 1000 Genomes metadata), we can do a cross-batch case-control analysis to look for the types of batch effects that might show up in a GWAS. (A case-control analysis across all populations will mostly just pick up on imbalances in the population representation between cases and controls.)
Using only VQSR filters, we have 347 significant hits, leading to a false positive rate of Q38. Adding on further filters, like "adj" genotype quality, Hardy-Weinberg equilibrium, callrate, and QD > 2 (for FE), we can get that number down significantly. Only looking at variants with AC >=20 doesn't make a big improvement, nor does excluding the "low complexity" region.
Filter (additive w/above) | # significant hits | total | Qscore |
VQSR | 284 | 2182976 | Q39 |
adj+HWE+callrate+QD | 54 | 2,059,989 | Q46 |
AC >=20 | 61 | 516,655 | Q40 |
no LCR | 36 | 509,611 | Q41 |
Results are similar for the African genetic ancestry samples. Only six false positive variants were shared between the European and African case-control analyses, indicating that lack of reproducibility across populations could be indicative of a batch effect. When labels were randomized there were no significant hits for either the European or African population samples.
All of Us-UKB combined analysis
Deflaux et al. published a paper using data from All of Us and UKBiobank to examine the differences between performing a GWAS using data pooled across the two cohorts or using a meta-analysis approach of combining the statistics for each callset after the fact. We didn't attempt to replicate their methods since our cohort size is many orders of magnitude smaller, but their procedure serves as an endorsement to several of the above approaches for mitigating batch effects. Similar to TOPMed, they eliminated rare variants by applying an AC filter. They also limited their analysis to variants that occurred in both cohorts and used source cohort or batch as a covariate in their linear regressions.
Do we need to watch out for the Batch Effect Boogeyman?
He's out there, but don't lose sleep over it. There are things we can do to protect ourselves.
Keep track of known batches
One of the most powerful things is to keep track of which samples are known to have come from different pipelines. Case-control analyses like GWAS are the most likely to be affected by the kind of batch effects arising from different pipelines. Deflaux et al. and many others doing GWAS analyses across cohorts added the batch identifier for each sample as a covariate in order to limit false positive associations.
Similar to having different pipelines, sometimes projects have exome data from different captures. It’s not ideal, but there are certainly ways to deal with it. The gnomAD exomes comprise several different exome captures over many years. You can read about their extra QC steps in the corresponding blog post (https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/) and the paper methods.
Generate an exclude list
The concordance numbers from this experiment are better for the easy and "medium" regions of the genome, so excluding the hard regions (like those identified as low complexity with the DUST algorithm) will improve concordance.
With the introduction of the Dragmap aligner, be on the lookout for regions that are represented on the canonical contigs and the alternate contigs. (You can get those via the *.alt file that goes with the hg38 reference – this gives the alignments of the alt contigs back to the canonical autosomes.)
More sophisticated exclude lists can be built using large cohorts of data. The list from the Atkinson et al. paper is included in their supplemental materials.
Don't trust variants not in LD
Luckily we can also leverage some of the properties of population genetics to help filter out false positive association results due to batch effects. We know that true association hits should be in linkage disequilibrium with a handful of surrounding variants. Batch effects stemming from alignment tend to be fairly localized in comparison to megabase haplotypes and will show few to no neighboring significant results in LD. In the supplement to the TOPMed paper, the authors state that "the lack of such LD partners, as observed here, is a classic sign of a spurious signal". For variants with AF > 5%, they expect with high probability that there will be other neighboring variants with LD r2 > 0.8 and p < 0.0001 within 25kb. In short, association results that are all by their lonesome probably aren't worth following up on.
Be conservative about concordance
Deflaux et al. acknowledge the challenges of performing analyses incorporating data from multiple sources. In their 2022 preprint, they compared GWAS results from two pipelines: one combining all the data into a single callset and one performing a meta analysis across GWAS results from each separate dataset. They chose to exclude from their pooled GWAS any biallelic variants that were not present in both batches they combined.
Conclusions
Batch effects are probably unavoidable in a lot of cases. The more sequencing data the field generates, the more inertia there is with regards to reprocessing all of that data using the newest references, the newest tools, or the newest version of GATK. It's a huge cost, a huge computational effort, and it’s only going to increase as the storage footprint of the data increases.
The largest sequencing projects like gnomAD (née ExAC) and TOPMed have combined data from different sources in their releases for years. Because of that fact, there's already a wealth of knowledge for dealing with this issue if you know where to look.
The Boogeyman is real, but he’s nothing to be scared of… but it pays to be aware of where he is when you’re planning your next experiment. As long as we are cognizant of where batch effects can come into play, and do our best to acknowledge where they might be coloring our results, it’s probably safe to sleep at night.
References
Atkinson et al. "Discordant calls across genotype discovery approaches elucidate variants with systematic errors" 2023. https://genome.cshlp.org/content/33/6/999
Taliun, Harris, Kessler, Carlson, Szpiech, Torres, Gagliano, Corvelo et al. "Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program" 2021 https://www.nature.com/articles/s41586-021-03205-y
Deflaux et al. "Cloud gazing: demonstrating paths for unlocking the value of cloud genomics through cross-cohort analysis" 2022 https://www.biorxiv.org/content/10.1101/2022.11.29.518423v1.full
A Jupyter notebook for the analysis contributing the results here and in part one is available on Terra at https://app.terra.bio/#workspaces/broad-firecloud-dsde-methods/joint-calling-FE-versus-dragen-batch-effect-exp
0 comments
Please sign in to leave a comment.