By Takuto Sato, Computational Methods Team, Data Sciences Platform
In the 18.104.22.168 release of GATK, we added a new tool, DownsampleByDuplicateSet. This tool randomly drops a fixed percentage of reads in a SAM file. The new feature it offers is that it handles reads that share the Unique Molecular Identifiers (UMIs) as a single unit. This enables a new kind of downsampling of reads—downsampling to simulate not only less sequencing but less library complexity. Before delving into the details, let’s first review the existing tools that GATK/Picard offers for downsampling reads. We hope that this will help to illuminate the kinds of operations and experiments the new tool allows us to do.
Downsample to simulate sequencing less
Suppose we sequence the whole genome of a human sample to an average depth of 30x. We then align the reads, mark duplicates, collect metrics, call variants. We achieve 99% SNP sensitivity with 95% precision. Since sequencing is not cheap, it is natural at this point to ask ourselves: for our next experiment, could we sequence a little less—perhaps to 25x instead of 30x—and still get similar results?
We often use downsampling to simulate such a scenario. Downsampling refers to the process of randomly discarding a specified fraction of reads from a SAM file. In our case, we might drop approximately 5/30 ~ 17% of the reads from our 30x SAM file so that the remaining SAM file has the coverage of about 25x. This is a pretty good simulation of stopping sequencing short when we reach 25x coverage, since we have no reason to believe that the last 5x we sequenced in going from 25x to 30x is any different from the rest.
Dropping reads, however, is not as simple as discarding lines in a SAM file at random using unix commands. This is because, in the paired short-read sequencing technology that is prevalent today in bulk DNA-seq and RNA-seq, reads come in pairs. By dropping reads line by line, we risk dropping a read and keeping its pair (also known as mate), and vice versa, not to mention supplementary and secondary alignments. We must avoid this in order for the simulation to reflect reality and for the resulting SAM file to remain valid.
The Picard tool DownsampleSam (and samtools view, for that matter) handles this problem by using a hash function that takes the read name (also known as the query name) string and maps it uniformly to a real number between 0 and 1. Because a read and its mate have the same read name, both reads map to the same number. So if we keep reads that map to a number less than 0.9, say, we drop roughly 10% of the original reads at random, and reads and their mates are either kept or dropped as a single unit.
Now, consider how DownsampleSam affects the duplication rate of the resulting SAM. Suppose that during the course of downsampling, each read pair is retained with probability p. Then a read pair and its duplicate are retained together with the probability p^2 < p. Thus on average the duplication rate will drop after downsampling. While a reasonable outcome for PCR duplicates¹, it is undesirable for optical duplicates, which tends to remain roughly constant across sequencing depths.
The Picard tool PositionBasedDownsampleSam was designed to handle this issue. Rather than mapping the read name to a probability, the tool keeps reads whose flowcell positions fall inside an ellipse that covers a specified fraction p of the flowcell tile.²
For instance, if p = 0.9, we draw an ellipse covering 90% of the flowcell tile. About 90% of all reads should fall inside the ellipse and therefore will be kept. This way, since the read and its mate are sequenced on the same position on the tile, they are either kept or discarded together. Furthermore, optical duplicates are also kept as a single unit with probability p rather than p^2, as we keep read pairs *and* their neighbors on the flowcell (up to the edge effect) assuming that optical duplicates occur uniformly across the tile.
Both DownsampleSam and PositionBasedDownsampleSam simulate sequencing less. With a downsampled SAM file we may estimate the loss in performance from losing a specified percent of reads as discussed above. We may also downsample one SAM to the coverage of another SAM to compare them fairly. Another application is a bioinformatic (in silico) mixture of DNAs from two individuals at specified fractions (e.g. 5% of person A mixed with 95% of person B). Such a mixture, for instance, may be used to evaluate a contamination estimate tool like CalculateContamination in the GATK family.
A different kind of downsampling — less starting material
In some instances, sequencing less may not be the desired simulation. For instance, consider the case of targeted sequencing of gene panels that is often done in a liquid biopsy assay. Many rounds of PCR are required to amplify a trace amount of circulating-tumor DNA, and we often sequence such a sample to an ultra-high depth (~20,000x) in order to boost (by consensus calling) the base qualities of the reads that contain the tumor allele. It is common in such a protocol to use unique molecular identifiers (UMIs) for more accurate duplicate marking.
Because we sequence a relatively small amount of starting material to such a high depth, we may sequence PCR-duplicates of many DNA molecules multiple times. We use the term family size to refer to the number of read pairs that share the same UMI (and thus originated from the same DNA molecule). The distribution of the family sizes for a particular sequencing run might look something like the rightmost panel in Figure 1.
Figure 1. The effect of sequencing less on family size distributions. The number at the top of each panel indicates the downsampling fraction relative to the original.
Now, suppose we have designed a new gene panel, say 10% in size of the standard panel. Assuming uniform bait efficiency, we expect the hybrid selection to capture roughly 10% of the DNA relative to the old panel. Thus we expect about 10% of sequencing relative to the original should be sufficient to achieve a similar depth for our smaller panel. How might we simulate this? We would certainly need some form of downsampling, but, as it turns out, DownsampleSam is not the appropriate tool for this particular simulation.
To see why, let us return to Figure 1. The rightmost plot shows the family size distribution using the original gene panel. To its left are the family size distributions after reducing the reads using DownsampleSam by fractions shown at the top of each distribution. We see that, as we sequence less, the weight of the distributions shifts to the left—that is, a randomly sampled UMI family has fewer and fewer reads in it. This reflects the fact that, assuming a fixed library complexity, we encounter less duplicates as we sequence less.
Why is this the case? Imagine you go fishing at a pond, and each time you catch a fish you mark and release it. If you fish for a whole month, you might catch some fish more than once (they have your mark on them). If you only have a weekend to fish, however, it's unlikely that you catch the same fish twice. In other words, the shorter the time you fish, less likely you are to see the same fish. All this depends on the size of the pond, too; after all, you may never see the same fish twice in the vast Atlantic Ocean, whether you fish for a weekend or for a month.
The same principle applies to sequencing and duplicates; the less you sequence, the less likely you are to encounter the same UMI (i.e. fish) twice.
What the simulation with DownsampleSam fails to capture is the fact that with our new panel the library complexity shrinks by 10%; we are now fishing at a much smaller pond, where we might expect to catch the same fish multiple times even over the course of a weekend. All this is to say that, if we reduce the amount of sequencing in proportion to the reduction in DNA—pre-amplification DNA—then we expect the family size distribution to remain roughly the same. We must simulate not only less sequencing but less library complexity also.
This issue came up when we attempted to create an in silico mixture of two healthy liquid biopsy (i.e. targeted, ultra-depth sequencing of UMI-tagged plasma DNA) samples to simulate cancer. The simulation was designed as follows: let's suppose that sample A and B had a similar number of reads and similar family size distributions to begin with, and the mixture is to be 5% Sample A and 95% Sample B. We ran DownsampleSam at 5% on Sample A and 95% on Sample B and ran Picard MergeSameFiles to combine the downsampled SAM files. We noticed, however, that the family distribution of the downsampled reads from Sample A looked vastly different from that of Sample B. In particular, the distribution of heavily-downsampled Sample A was like the leftmost panel in Figure 1, whereas barely-downsampled Sample B's distribution looked like the rightmost panel. This was highly undesirable for two reasons. First, we cannot create consensus reads from reads of family size one (i.e. none of its PCR duplicates were sequenced). Second, in a real tumor sample, we expect the tumor reads and non-tumor reads to have the same family size distribution; PCR or the sequencer does not care whether the DNA is tumor and non-tumor.
The new tool, DownsampleByDuplicateSet,³ was designed to address this very issue. It drops reads with the same UMI as one unit, rather than dropping read pairs. This way, reads and library complexity are reduced at a specified fraction, thereby preserving the original family distribution after downsampling. See Figure 2.
Figure 2. DownsampleByDuplicateSet preserves family size distributions. The family size distribution before downsampling (blue), after downsampling with DownsampleSam (green), after downsampling with DownsampleByDuplicateSet (red), which preserves the original family size distribution.
Note that while DownsampleSam was not the right tool for our particular experiment, it may very well be the appropriate tool for other use cases. If we wish to study the effect of less sequencing on duplication rate or family size distribution in liquid biopsy, for example, we would want to simulate sequencing less while fixing library complexity constant. In this scenario, DownsampleSam would be the appropriate tool.
We have reviewed the "sequencing less" method of downsampling, and why that might not be the appropriate method in certain cases. The new GATK tool DownsampleByDuplicateSet offers a different approach to downsampling that is well-suited if we wish to preserve the family size distribution of the original SAM file.
Do you have an experiment in mind where DownsampleByDuplicateSet may be useful? Are there other types of downsampling that GATK/Picard does not currently support? We read all your comments, so please feel free to leave one for us below.
¹ PCR duplicates are two read pairs (fragments) that originated from the same original DNA library. Two fragments are deemed PCR duplicates when their alignment start and end positions match (and have the same UMIs, if they are available). Optical duplicates, on the other hand, occur when a sequencer detects the fluorescent signal from a single DNA insert as multiple units from nearby clusters, thereby creating multiple copies of the same read. Thus when we find two fragments that match in bases and were sequenced on nearby positions on the flowcell, we marked them as optical duplicates. Distinguishing the two kinds of duplicates improves the library complexity estimate.
² A typical Illumina flowcell consists of lanes. Each lane in turn divides into two columns of 50 to 60 tiles arranged in rows. The sequencer takes a photograph of each tile.
³ Note that DownsampleByDuplicateSet takes as input a SAM file preprocessed with an FGBio GroupReadsByUMI, which is a nifty tool that sorts a SAM file by their UMIs and coordinate positions.