At a high level, deduplication systems: 1. identify duplicate objects and 2. eliminate redundant copies of information
How the system defines and “object” and how the system defines a “redundant copy” is system specific.
Deduplication systems can be defined along several axes.
On-line vs. offline
In on-line dedup systems, the deduplication process happens at the time that data is written. A typical online deduplication system works as follows:
When the system receives a request to write some new object O (i.e., before O’s data is actually written), the system checks to see if a duplicate copy of O already exists.
If the object is found to be unique, then O is written as normal.
If the system determines that a duplicate object is already present in the system, the new object O is not written; instead, a reference to the original copy is stored.
In off-line dedup systems, the deduplication process happens after data is already persisted. A typical offline deduplication system will have a background process that scans through persistent data and replaces duplicate objects with references to a common copy. Offline deduplication ultimately saves space because redundant copies are replaced with a reference, but there is a (potentially long) time window between when a duplicate copy is written and when the copy is replaced.
Fingerprinting
The term fingerprint is used to describe a “unique” content identifier. An example of a “unique” content identifier is a cryptographically strong hash function because:
No matter how many times I compute a hash function on an given object, the output is always the same.
If my hash function is cryptographically strong, the probability of a collision should be very very very low.
In many deduplciation systems, fingerprints are used to simplify the task of identifying duplicate data. One way to determine whether an object is a duplicate is to scan through all other objects on the storage system and do a byte-by-byte comparison, but that would be impractical. Instead a cryptographic hash can be used to compactly represent an object’s contents. Then, to determine whether two objects (regardless of their size) are equivalent, only their fixed-size fingerprints need to be compared: if the hashes are the same, then the objects have the same contents.
For this scheme to work, the probability of a hash collision must be very-close-to-zero. In other words, if two objects differ by even one byte, then their hashes should be different. That is because, in a deduplication system, a hash collision signals that two different objects are the same, which introduces a correctness error.
We often choose hash functions and hash sizes (i.e., the number of bits in the fingerprint) such that the probability of a hash collision is less likely than the probability of a hardware error.
Since we assume that the bits in a crytpographically strong hash function are uniformly random, each additional bit that is added to a hash will double the number of unique values that the hash function can represent.
The “birthday paradox” is often used to analyze the collision probability and justify a particular hash size.
Chunking
Chunking is the process of subdividing a data stream into pieces, or chunks. Chunks can be whole-file objects, fixed-size objects (similar to how a file is divided into 4KiB blocks), or variable-sized objects.
Whole-file deduplication is simple and often has low overheads. In whole-file deduplication, if two or more files are exact copies, only one version of the file is stored. Subsequent files with matching contents are added to the system as references to the original.
A benefit is that the amount of metadata required to keep track of all whole-file-objects in the system is quite low; it scales with the number of files.
A downside is that any modification to a file requires a unique copy to be made (breaking all sharing with similar files), which means that whole-file deduplication systems often have lower compression ratios than systems that use finer-granularity chunking schemes.
When dividing a data stream into fixed-size chunks, chunk sizes are often chosen so that the sizes are multiples of system hardware parameters, like memory pages or disk sectors.
One benefit is that the process of breaking an object into fixed-size chunks is relatively easy, since no computation needs to be performed.
Another benfit is that, if a single chunk is modified, common chunks can still be shared with other objects’ chunks.
One downside is that, if data shifts (for example, after the insertion of a byte at the head of a file), then all subsequent chunks will shift. Thus, local changes can falsely cause duplicates to be treated as unique
Another downsides is that the amount of metadata required to keep track of all objects in the system scales with the amount of data (not the number of files).
Variable-size chunks are often defined by the contents of the objects using Rabin fingerprints or some other sliding window method.
The benefit of most varaible-size chunking schemes is that, since the chunk boundaries are determined by the contents of a sliding window, local changes to one chunk are often isolated to that chunk. In other words, unlike fixed-size chunking, where adding or deleting data would shift the boundaries of all chunks that follow the data, variable sized chunks are not affected by this so-called boundary-shifting problem.
If a single chunk is modified, common chunks can still be shared
If data shifts, then it is unlikely that nearby chunk boundaries are affected, since the boundaries are determined by the contents
One downside is that, since variable-size chunking requires computing a sliding-window calculation over the data, the chunking process itself may add some CPU overheads to the deduplication process. However, this CPU overhead is often offset by the I/O savings that accompany the higher compression ratios.
Chunk Size
Whether fixed or variable, the size of the chunks affect a deduplication system’s performance.
Using large chunks creates lower metadata overheads (fewer chunks, so the number of fingerprints in the system is smaller), but large chunks usually result in lower deduplication ratios (coarser granularity of sharing).
Using small chunks creates higher metadata overheads (more chunks, so the number of fingerprints in the system is higher), but small chunks usually have higher deduplication ratios (objects can share data at finer granularities).
If indexing is the act of organizing a set of data, then the fingerprint index is one of the most important parts of a deduplication system. A typical fingerprint index might store:
all fingerprints in the system,
the location of the chunks that correspond to each fingerprint
the reference count of each chunk
Thus, the fingerprint/chunk index can be a performance bottleneck in large deduplication systems: on every write, it must be consulted to see if the chunk is a duplicate, and in many designs, it must be consulted on reads to find the data associated with a fingerprint. A significant challenge is that hashes are randomly distributed, so fingerprint index lookups often have no locality within the indexing data structure, even if the workload itself has high locality. And since the fingerprint indes likely will not fit into RAM, this would correspond to many expensive I/Os if not managed properly.
Bloom filters can be used to detect whether a fingerprint exists, eliminating the need for some unnecessary lookups.
Some systems group fingerprints into groups on disk. If groups are defined by temporal locality, then caching and evicting based groups may improve cache efficiency
Question: how would you define the appropriate “group” for a very common chunk (e.g., the chunk is common to many unrelated files, each with their own fingerprint group)