Filters

Filters are a class of data structures that support efficient membership queries. In other words, filters store approximate representations of sets, and filters let users quickly check whether an item is present in a set or not.

If a filter says that an item is not present in the set, the item is definitely not present in the set. In other words, “No” answers are always firm. On the other hand, filters generally allow for false positives. A false positive occurs when the filter says that an item is in the set, but the item is actually not present in the set. In many contexts, this is still useful behavior.

Filters are often used to save unnecessary work (typically expensive I/Os), and when a false positive shows up, the the result is just extra that would have been done anyway (in the absence of the filter).

This unit covers Bloom filters (the most popular and well-studied filter) and Quotient Filters (a filter that often yields better performance due to its cache-friendly access patterns).

Bloom Filters

Bloom filters are tuned based on three parameters

False negatives are never tolerated, but for a given N, m and k can be chosen to minimize the false positive rate.

operations

To insert into a Bloom filter, you hash the element using each of the k independent hash functions. You then set the corresponding bits in the m-bit array.

To query a Bloom filter, you hash the element using each of the k independent hash functions. Then you check whether the corresponding bits in the m-bit array are set. The Bloom filter returns:

Note that the negative answer is conclusive, but the positive answer does not provide a guarantee. A Bloom filter query could return a false positive if some combination of other elements previously inserted into the array had caused the bits that correspond to the queried element’s k hash function’s bits.

Standard Bloom filters do not support delete operations. This is because the filter does not track which element is responsible for setting any given bit. If there is a collision at bit i, then reseting bit i will remove all elements with at least one hash function that sets bit i. There are Bloom filter variants, such as “counting Bloom filters”, that support deletes. However, counting Bloom filters require additional metadata.

Quotient Filters (QFs)

Quotient filters (QFs) are essentially hash tables that combine linear probing with a few extra metadata-bits-per-bucket to manage collisions. However, as described in the paper, QFs Quotient filters do not store key-value pairs; instead, QFs just track set membership.

When storing an item K, the QF calculates h = hash(K), and stores a subset of h in the array. When looking up a item K, the QF calculates h = hash(K) and looks for that subset of h in the array. Like Bloom filters, QF parameters can be adjusted to tradeoff memory consumption with accuracy.

Quotient filters rely on a technique called (wait for it…) quotienting to reduce the size of the hash value that the filter stores.

A quotient filter breaks a fingerprint into two parts:

The quotient determines the fingerprint’s index in the array, and it is never explicitly stored. The bucket indexed by an element’s quotient is called its canonical slot, and the remainder is stored in a fingerprint’s canonical slot (in the absence of collisions).

In the case of a collision, remainders are stored within a run. A run is a sorted list of remainders that share the same canonical slot. A maximal sequence of adjacent runs is called a cluster. Collision resolution can cause an entire cluster to shift — including remainders that do not share canonical slots. In fact, only the first colliding run in a cluster will start at its canonical slot.

Three additional bits are kept per bucket in order to help resolve collisions:

A quotient filter example can be seen in the image below, followed by an explanation that gives more details about the usage of these bits.

Quotient Filter Details

Compared to Bloom filters, Quotient filters are very cache friendly. Each lookup/insert jumps to a single random offset determined by the hash value, as opposed to k independent hash functions in a Bloom filter. If the load factor is managed properly, runs and clusters are quite small, so very few cache lines must be accessed in order to perform any QF operations.

Quotient filters form the foundation of a write-optimized data structure called a cascade filter, due to their ability to efficiently resize and merge.

Questions

  1. What types of applications can benefit from using Bloom filters?
  2. What types of applications could never use a Bloom filter?
  3. How good/bad is the cache behavior of a standard Bloom filter (i.e., do lookups/inserts into the bit array have any locality)?
  1. For the following operations, decide whether a standard Bloom filters could support the operation, and why/why not (or under what conditions it is possible):
  1. Can you think of any way to modify the standard Bloom filter to support deletes?
  1. Which data structure do you think is the easiest to implement:

What types of things seem most challenging to get right? Do those things affect performance or correctness?