Deduplication

CS333

Spring 2021

Deduplication

Deduplication is a form of compression.

At a high level, deduplication systems: 1. identify duplicate objects and 2. eliminate redundant copies of information

How the system defines and “object” and how the system defines a “redundant copy” is system specific.

Deduplication systems can be defined along several axes.

On-line vs. offline

In on-line dedup systems, the deduplication process happens at the time that data is written. A typical online deduplication system works as follows:

  1. When the system receives a request to write some new object O (i.e., before O’s data is actually written), the system checks to see if a duplicate copy of O already exists.

In off-line dedup systems, the deduplication process happens after data is already persisted. A typical offline deduplication system will have a background process that scans through persistent data and replaces duplicate objects with references to a common copy. Offline deduplication ultimately saves space because redundant copies are replaced with a reference, but there is a (potentially long) time window between when a duplicate copy is written and when the copy is replaced.

Fingerprinting

The term fingerprint is used to describe a “unique” content identifier. An example of a “unique” content identifier is a cryptographically strong hash function because:

In many deduplciation systems, fingerprints are used to simplify the task of identifying duplicate data. One way to determine whether an object is a duplicate is to scan through all other objects on the storage system and do a byte-by-byte comparison, but that would be impractical. Instead a cryptographic hash can be used to compactly represent an object’s contents. Then, to determine whether two objects (regardless of their size) are equivalent, only their fixed-size fingerprints need to be compared: if the hashes are the same, then the objects have the same contents.

Chunking

Chunking is the process of subdividing a data stream into pieces, or chunks. Chunks can be whole-file objects, fixed-size objects (similar to how a file is divided into 4KiB blocks), or variable-sized objects.

Chunk Size

Whether fixed or variable, the size of the chunks affect a deduplication system’s performance.

As a reference point, the original Data Domain deduplication system paper used a variable size chunking with a target chunk size of 8KiB.

Indexing

If indexing is the act of organizing a set of data, then the fingerprint index is one of the most important parts of a deduplication system. A typical fingerprint index might store:

Thus, the fingerprint/chunk index can be a performance bottleneck in large deduplication systems: on every write, it must be consulted to see if the chunk is a duplicate, and in many designs, it must be consulted on reads to find the data associated with a fingerprint. A significant challenge is that hashes are randomly distributed, so fingerprint index lookups often have no locality within the indexing data structure, even if the workload itself has high locality. And since the fingerprint indes likely will not fit into RAM, this would correspond to many expensive I/Os if not managed properly.