By Yossi Farjoun, Associate Director of computational research methods in the Data Sciences Platform
**A note to explain the context of the paper by Nauman Javed, Y.T. and others, “Detecting sample swaps in diverse NGS data types using linkage disequilibrium” (available as a preprint in bioRxiv). In this paper we used a flexible genotype comparison tool from the Picard toolkit — CrosscheckFingerprints — to compare thousands of ENCODE samples and found that about 3% were mislabeled. The CrosscheckFingerprints tool is part of a suite of Picard tools that all use the same underlying infrastructure to verify sample identity. They were originally developed around 2009 by Tim Fennell and have been used at the Broad Institute in the sequencing production facility ever since. Note that in this post, I describe the system that is used at the Broad Institute, not the paper mentioned above. The paper makes heavy use of this fingerprinting tooling and, to my knowledge, serves as the tool’s publication debut.**
One thing that keeps me up at night is the nightmarish scenario of a "sample swap".
Imagine you have a sample that is sequenced and analyzed as if it came from individual A, but in fact came from individual B. This can happen for many reasons — mislabeling of sample sheets, rotated sample plate, informatic error, to name a few — but the end result is the same: we're going to provide the wrong information to the research team. If the sample in question is part of a trio, you might end up mistakenly inferring that some variants in your callset arose through de novo mutations, when they simply came from someone else. Or if you're looking for somatic variants, your pipeline might end up calling contaminant germline variants as somatic. In a clinical setting this can be especially devastating, since it means that a patient might effectively be given someone else's test results. So it's very important to have strong safeguards in place to detect sample swaps.
There is also a second level at which this can occur: "readgroups", which are subsets of the sequencing data generated for each sample when it is sequenced through multiplexing, as we do at the Broad Institute. Since we generate the data for a sample's multiple readgroups separately and then aggregate them in the course of the analysis pipeline, we want to be sure that all the readgroups that are identified as belonging to a given sample actually came from that correct individual. Otherwise, an errant readgroup could end up contaminating the aggregated sample and cause lots of other headaches downstream similar to what I described above.
So, what can be done to prevent sample or readgroup swaps? The simple answer is that the genomes of different individuals contain different variants, so we can use that information to compare different genomic datasets and see if they look like they are coming from the same individual or not. However, to do this in practice and at a production-worthy scale, we have to answer several questions:
- How many variants should we use?
- Which variants should we use?
- How should we make the comparisons?
- What external (“orthogonal”) data would/should we use?
- What do the false-positive and false-negatives of the system look like?
- How does the algorithm tell us that it “doesn’t know” the answer?
When we originally set out to address this, we were looking for a solution that would be a one-stop shop. An assay we could run cheaply on all samples to provide orthogonal information, could be used at the readgroup, library or sample level; involving code that would run fast and work on shallow and deep data, and also work on exomes, genomes and RNA data. Oh, and we wanted it to not be biased towards any sub-population, e.g., we would not want to have better sensitivity for finding swaps in European samples than in African samples.
The existing tools didn’t satisfy all our needs, so we created our own.
First, using data from the 1000 Genome Project and HapMap, we identified ~300 common (~50% allele frequency) SNPs located on large LD blocks — regions in the genome that contain many SNPs that “travel” together from parent to child — that are shared across the major sub-populations. This would allow us to generate a "fingerprint" for every sample and for each of its readgroup based on the combination of genotypes present in the data.
Note that the approach of using LD blocks (rather than just looking at a set of individual sites) increases the genotyping power for shallow BAMs since it combines information from different SNPs in the same LD block, as shown in the figure below.
As an orthogonal assay, we designed a small (~100) SNP array that we run on most samples directly upon arrival at the lab. This way we can generate a fingerprint from each sample before it goes through any other processing.
We can check the concordance between the array-based and sequence-based fingerprints and from there, compare the likelihoods for the two major scenarios: a non-swap vs. a swap. To that end, we formulated a LOD score that expresses the Log ODds ratio between the two cases (see mathematical details here). A positive LOD score indicates a higher likelihood that the two samples come from the same individual, while a negative LOD score indicates that it’s more likely that they are from two different individuals. Note that the model assumes that the two different individuals are unrelated, so when you're comparing a parent to a child, the LOD score is smaller in size, but its sign is still negative as expected.
In practice, the LOD score calculation is implemented in Picard and exposed as two Picard tools (which are available in GATK4), CheckFingerprints and CrosscheckFingerprints. The former is only built for comparing a BAM/CRAM to a VCF, so we use it to compare the sample BAM to the SNP array VCF. The latter can take any number of BAMs and VCFs as input and perform an everything-by-everything comparison, which can be useful for making sure that all the read-groups agree with each other or for comparing two files together. We also use it later in the pipeline to check that the GVCFs that went into joint variant calling agree with the resulting VCF, to exclude the possibility that the program reordered the sample list or something unfortunate like that.
To validate the algorithm and implementation, we took many known samples and cross-checked them systematically against themselves and each other. If you'd like to run this analysis on your own data, the resource files we used are available for both hg19 and hg38..
When comparing readgroups from the same individual (labeled “same sample” in the figure), we see that, as expected, the LOD Score increases with the number of reads. In contrast, when comparing readgroups from different individuals (labeled “Different Sample”), the LOD Score is negative and decreases with the depth.
Careful observers may note that there is a set of RNA readgroups that got LOD=0 or even a positive LOD when compared to a different individual. Upon further examination, we found that the positive LODs corresponded to comparisons between different samples that originated from the same individual. This means that CrosscheckFingerprints correctly identified that even though we had marked these as being "different", the data revealed that they belonged to the same individual. Meanwhile, the zero-LOD RNA samples result from situations where there is no sequence coverage in either or both RNA samples at the sites used for fingerprinting. Since RNA coverage is extremely variable, this can only be resolved by looking at many more SNPs. If you're interested in learning more about this, the paper I mentioned earlier covers this in some detail and includes a different resource file that has over 18K SNPs. That resource has been shown to work when comparing different assays such as ChIP seq, exome, RNA, DNAse seq.
Finally, we wanted to see if we could compare different assays of the same individual, and found that, indeed, this is not only possible but very straightforward:
As you can see, most of the comparisons result in a positive LOD, and the very small blip over the zero indicates that a tiny fraction of the comparisons result in an inconclusive LOD.
I’m happy to say that at this point, we have very few actual sample-swaps that occur in our facility, and when they do occur, we are able to catch them thanks to this approach. However there are still “false positives”: we have seen cases where issues like contamination, “noise” (lots of mismatching bases), or large-scale Loss of Heterozygosity (LoH) in a high purity somatic sample can cause a negative LOD where there shouldn’t be one. To mitigate this problem, we also collect additional quality control metrics like alignment and contamination metrics, which enable us to weed out such cases.
Finally, you should know that the ecosystem of tools that pertain to fingerprinting is a bit larger than what I've described here, so I’ll leave you with a list of tools to explore further. Happy artifact hunting!
CalculateFingerprintMetrics --- A tool for evaluating if a fingerprint (i.e. genotypes from a BAM/VCF) look like they came from a single person.
IdentifyContaminant --- A tool to extract (from a contaminated sample) a fingerprint for the contamination. This requires an estimate of the level of contamination. It can be used to identify the contaminating sample.
ExtractFingerprint --- A tool for extracting the fingerprint information from reads into a small vcf which can then be used to access quickly to fingerprint against other samples.
ClusterFingerprintMetrics --- A tool for finding clusters of readgroups/libraries/samples/files that are connected via a positive LOD from the results of running CrosscheckFingerprints.