Research Overview

Because the majority of my work involves designing, building, and evaluating large computer systems, collaboration is important—building a new system, especially one with an original design and that does not extend an existing system, requires a significant upfront investment. I have and I continue to work as part of large teams, and my primary “research group” is one that has made a significant investment in the BetrFS file system. In what follows, I give a quick overview of storage research and my interests in the field. I then describe my large body of work with my BetrFS team collaborators. I then describe some active projects in other areas and teams, many of which are ideal projects for Williams undergraduate students.

If you are a Williams undergraduate who would like to work with me, please reach out to set up a meeting. I would love to talk about ideas that you have, or to expand upon any of the projects discussed below.

My Approach to Storage Research

My primary research focus is storage. In the broadest terms, the goals of storage research are to help users to maximize:

performance: all users want to read, write, and access their data faster
capacity: all users want to store more data without consuming more resources
safety: data must be kept secure (i.e., free from loss, corruption, and unauthorized access)

Most of my currently published work focuses on performance—performance models, performance measurement, and system designs that improve performance. We have also contributed new operations that would have been prohibitively expensive in other designs; in other words, we have contributed operations that are only possible in our system because of its particularly strong and unique performance profile.

The software that I typically work on is called a file system. File systems act as an intermediary between an application (e.g., a web browser), and physical media (e.g., a hard disk drive). File systems give applications a standard language to use when communicating with devices, which is necessary because most devices have different physical properties; it would be impractical if the developers of an application like the Firefox web browser were required to distribute a different version of the application for each unique storage device that has ever been manufactured. Beyond providing a portable, usable, and safe interface to applications, file system designers continue to optimize for the three goals listed above: performance, capacity, and safety.

In addition to designing file systems, I also explore ways to efficiently manage devices themselves. Since different device types often have different properties, like endurance (some devices, like SSDs, wear out over time), interface (some devices utilize different sets of commands when communicating with software), degree of parallelism (some devices can satisfy multiple requests simultaneously), and cost (devices have different monetary costs in dollars per gigabyte of storage), the way that computer systems are designed should be influenced by device properties. For example, different trade-offs become desirable when the relative costs of operations change, and different designs are required when interfaces are incompatible.

Theory-driven system design. I believe that system designs should have sound theoretical foundations. I collaborate with several groups of researchers with diverse sets of backgrounds, and all of us share this vision. My teams consist of theorists, operating systems designers, and security experts, and each of us contributes unique and essential skill sets. By co-designing systems and algorithms, we are able to develop both systems and algorithms that would not be possible otherwise.

In the BetrFS team specifically, our unique approach of bringing theoretical innovations to storage system design has resulted in a successful run of papers. I am not aware of many other research groups that have similarly broad sets of skills. Because we are simultaneously attacking storage problems on multiple fronts, both in systems design and data structure design, the scope of problems that we can attack and the scope of our available solutions is extremely wide-ranging.

Other non-BetrFS research projects. Working with the BetrFS team is an important part of my research program; our collaboration model and our abundant shared resources help us to maintain momentum. However, I have active projects/collaborations on other storage topics. In particular, I have projects related to zoned storage devices, deduplication, and deferred range queries. Below, I overview some of my active projects, starting with BetrFS.

The BetrFS Project Trajectory

Our team’s main research vehicle, the B^ε-tree file system (BetrFS), achieves orders of magnitude performance gains on important operations, without sacrificing performance on others . For example, BetrFS can perform microwrites—small data or metadata updates—orders of magnitude faster than the next best general-purpose file system, and BetrFS still matches other file systems on sequential read performance.

Simultaneously achieving small write performance and good sequential read performance has been such a long-standing challenge in the storage community that file system designs have diverged into two distinct categories: log-structured file systems, which have good small update performance but poor sequential read performance, and update-in-place file systems, which have good sequential read performance but poor small update performance. By applying modern data structure techniques, specifically write optimization, we have shown that there is not a fundamental trade-off between updates and sequential reads; instead, previous designs have simply relied on brittle heuristics that fail under common use cases.

Our ultimate goal with the BetrFS project is to build a truly general-purpose file system. In other words, we believe we can build a file system that performs as well as the best competing file system on every operation and on every device. Our strategies are to (1) perform data structure and system co-design (i.e., innovating in the theory and system design simultaneously) and to (2) take an incremental approach (i.e., we would start with a prototype that demonstrates promise, and incrementally tackle our biggest system bottlenecks until we fully realize our vision).

The first BetrFS paper's goal was to use an off-the-shelf B^ε-tree implementation as a “black box”, and show that we could overcome the perceived trade-off between small writes and sequential reads. However, we knew that there would be limitations to the black-box approach that we would only be able to overcome by modifying the data structure internals and the surrounding system infrastructure. Our first implementation successfully demonstrated the power of write-optimization, but we predictably hit the limitations of a black-box approach.

In our second BetrFS paper, we focused on fixing the performance of our two slowest operations, but we limited our modifications to the schema (the way that we name our data within our data structure) and the surrounding infrastructure (e.g., the write-ahead logging implementation). In this paper, we sacrificed some of the “purity” of the original design by introducing a configurable level of indirection to our schema.

In our next third paper on core BetrFS design, we dug into the data structure internals and made invasive changes; specifically, we contributed data structure techniques for performing efficient namespace operations in a B^ε-tree, and we used those techniques to implement fast renames in a full-path indexed file system. BetrFS is a full-path indexed file system because each file object is named by its full file system pathname. For example, the file /home/bill/data.txt is comprised of a series of key-value pairs, where keys take the form /home/bill/data.txt_BLOCK_NUMBER and the values are the data associated with a given 4KiB block.

Rename operations are particularly hard in a full-path indexed file system because renaming a file object requires updating a prefix in all of its keys. For example, renaming the file /home/bill/data.txt to /public/data.txt requires taking every single block that makes up the file and replacing the prefix “/home/bill/” with the prefix “/public/”. There are no corresponding data structure operations for bulk renames, so a single rename of a BetrFS file requires many individual operations to delete all the old keys (one for each file block) and replace them with new keys. This is bad because renaming a file becomes more expensive as the file gets larger. Ideally, the cost to rename any file would be fast, regardless of its size.

To implement fast renames, we converted the underlying B^ε-trees in BetrFS into a form of a B^ε-trie using a technique we call lifting. In a lifted B^ε-tree, interior node pivots store portions of key prefixes, and full keys are assembled by concatenating all the prefixes along a root to leaf path. We then implemented tree surgery by (1) identifying the least common ancestor (LCA) of the source and destination, (2) “slicing” the tree at the source and destination, (3) moving the source subtree into the destination subtree, and (4) “healing” the tree by merging nodes. Tree surgery requires rewriting a logarithmic number of nodes (the base, B, is very large, so the height of the tree is quite small), rather than scaling with the number of keys being renamed. Put differently, we made serious modifications to the way we defined our data structure, and we showed that, after those modifications, the worst-case behavior is actually quite good.

The benefit of tree surgery is that moving nodes to a new location within the tree can be performed by updating pointers rather than by paying the prohibitive cost of rewriting physical nodes. However, in many cases keys that are stored inside of nodes still need to be changed because the affected prefix is not fully “lifted” out of the subtree. To accomplish this, we created a new operation that alters the behavior of future operations on the tree. The “message” (i.e., a persistent encoding of an operation as data within the tree itself) essentially instructs future operations to substitute all instances of a key prefix with a new key prefix. This message does not alter the asymptotic performance.

Now that we had demonstrate BetrFS’s strong performance on all standard operations, we wanted to demonstrate that, by incorporating modern data structures, we could tackle long-standing challenges in file system design. In particular, our fourth BetrFS paper describes our implementation of directory cloning. Cloning is the process of creating a logical copy of an arbitrary directory subtree (e.g., I could clone my home directory to make a copy of all of my files). BetrFS clones are exciting because they are both writable and performance neutral. In other words, both the original folder and the logical copy can be modified, creating the clone does not affect the performance of operations on the original, and the performance of operations on the clone are as fast as performance of operations on the original. The applications of space-efficient performance neutral clones are numerous. For example, imagine taking a Time-machine style backup of your files every minute, with no noticeable overheads.

Our fifth BetrFS paper brought us closer to our vision of building a truly general-purpose file system. Up until that point, we had shown that BetrFS has top-of-the-line performance on all operations when run on a hard drive. However, we observed that, when the only change we made was to swap out the hard drive for a faster SSD, BetrFS performance dropped well below the competition for many important operations. To fix this, we contribute a series of optimizations that amount to a grab-bag systems design insights, streamlined kernel interfaces, and data structure integrations. Although many of the optimizations were specific to BetrFS, the overarching ideas are applicable to general systems design. We also outlined a criteria for what it means to be general-purpose, or compleat, and we showed that BetrFS is the only file system to achieve compleatness on commodity SSDs.

BetrFS summary. BetrFS is a file system that uses modern data structure techniques to improve performance. Since we ground our designs in theory, BetrFS has no algorithmic “worst cases”. Since we design data structures and systems in tandem, we can create new optimization opportunities that would not otherwise be possible. BetrFS is a longstanding project that has won several awards, and we are continuing to push BetrFS designs into new places.

New Models

Building system software is incredibly labor intensive. We invested many person-years into building BetrFS, but we only did so after using theoretical models to show that real performance gains would justify the effort. Thus, accurate and predictive models are incredibly valuable for systems researchers.

The Disk Access Model (DAM) has long been the gold standard for evaluating external memory algorithms (an external memory algorithm is one where the data being processed is so large that it does not all fit into memory at once). However, the DAM is limited. We've shown that the DAM often cannot be used to choose important parameters when tuning algorithm performance because the DAM abstracts away the low level details of the hardware.

We introduced two new models, both variations of the DAM, that more accurately reflect physical reality. The affine model is a DAM refinement for hard drives. In the affine model, each operation is charged a “setup” cost, and then charged a “bandwidth” cost that is proportional to the size of the operation. This corresponds to the physical actions taken by a disk: the disk performs a seek (setup) followed by one or more rotations to read the actual data (bandwidth cost).

The PDAM model is a DAM refinement for solid state drives (SSDs). Most solid state drives are capable of performing several operations in parallel; in fact, to achieve their advertised performance, one must perform multiple operations in parallel. The PDAM models SSDs by adding a parallelism parameter to the DAM.

After motivating the affine and PDAM models and validating their accuracy on real hardware, we show that we can use these models to inform real data structure parameter choices.

Models summary. We presented two variations on the DAM and showed that the variants are more powerful and predictive. We validated those models empirically and used the models to tune real data structures.

Aging

Aging is the process in which a file system grows slower over time. The aging slowdown is the result of fragmentation—when logically related objects become physically scattered. Fragmentation is problematic because it is more efficient to access objects that are closer together. In our first aging paper, we analyzed the aging characteristics of real file systems and we contributed our set of tools and aging workflows. We showed that all non-BetrFS file systems age—they age dramatically and they age quickly.

Conventional wisdom suggests that “disk fullness” is the primary cause of file system aging, which makes sense. Consider a movie theater. As the seats fill, the number of large contiguous runs of open seats decreases. If a large party enters a nearly full movie theater, it may be impossible for them all to sit together; the large party may have to break up into several smaller groups to be able to find seats. Intuitively, finding contiguous free space becomes harder as the theater becomes more full.

File systems try to combat this problem by placing related objects together, hopefully preserving large ranges of free space. However, these file system data placement policies are merely heuristics. We hypothesized that disk fullness exacerbates aging, but that aging can cause significant slowdowns even when disks are nearly empty. Our workshop paper on aging presents experiments that support this hypothesis. We conclude that “usage” is the primary cause.

Aging summary. We show that aging is exacerbated when disks are full, but that aging can cause significant slowdowns well before space is an issue. Free space can become quickly fragmented by real-world file system usage.

Zoned Storage Devices

Shingled magnetic recording (SMR) hard disk drives resemble standard hard drives in many ways: they contain similar physical components and have similar performance profiles. However, SMR hard drives boast increased capacity due to a technique called “shingling”. A “shingled write” is a write that partially overlaps previously written data, similar to how each row of shingles on a roof partially overlaps the row below it. Shingling increases the drive’s density, but it also restricts the types of updates that the drive can safely perform. For example, if a roofer secured one row of shingles on top of another, a “bottom” shingle could not be safely replaced without first removing the shingles above it.

The nature of shingling requires SMR hard drives to use a non-standard device interface, called a zoned block interface, that is incompatible with existing storage software. Thus, for these SMR drives to be usable, SMR hard drives implement a translation layer that converts the requests that existing storage software make into requests that are safe for SMR hard drives.

Together with colleagues at NetApp Inc., I was awarded a patent for developing a tree-based translation layer that converts standard hard drive requests into requests that are compatible with a zoned storage device, like an SMR drive.

Our SMR translation prototype drastically improves the I/O performance profile of many block workloads on host-managed SMR drives; non-sequential update workloads are transformed into more efficient sequential writes using write-optimized techniques. Our prototype code is available under an open-source license.

I am very interested in extending this project to other zoned storage devices, in particular ZNS SSDs. If you are interested in collaborating, or if you have access to real ZNS SSD hardware, please contact me.

Bringing Write Optimization to Reads

One of the reasons that the BetrFS project is so exciting to me is that our team has made advances in the theoretical side of computer science—specifically in the area of write optimization—and incorporated those advances into our solutions to systems problems. Write optimization is a strategy that, as the name suggests, speeds up write operations. Write-optimized data structures work by deferring the costs associated with completing related tasks until enough related tasks have accumulated to make paying their cost worthwhile. At that point, a series of pent-up tasks are completed in a single, efficient batch.

There are certain types of read operations to which this strategy could also be applied. Any read that does not need to be completed immediately, such as the type of large summary computation that the MapReduce programming model was designed to solve, could benefit from batching as long as the operation can be encoded as an object in a write-optimized key-value store. Deferred range queries, or “derange queries”, do just that.

Together with Sam McCauley and two students, Nasim Borazjanizadeh ’23 and Christopher Chung ’22, we defined a scheduling problem to model derange queries, and we investigated the problem through a theoretical lens. We made some progress on a simplified version of the problem (scheduling the tasks offline in a write-optimized linked list, i.e., a B^ε-tree with fanout 1), and Sam and Christopher made significantly more progress in Christopher's undergraduate thesis. There is much work left to explore on this problem.

Together with Jun Yuan at Pace University and Boyang Wang '21 we incorporated a limited set of derange query functionality into a working system, which Boyang documented in her masters thesis. Testing derange query performance on production workloads is our next step.

Content-Aware System Evaluation

One of my primary areas of storage expertise is deduplication—a form of compression that works by identifying instances of duplicate data and storing only one copy of that data. Standard storage evaluation tools focus on creating patterns of operations that match target workload profiles. Unfortunately, data content is outside the domain of many (most?) standard tools (i.e., they instead focus on operation sizes, durations, locations, and the compositions of different operations like the read/write ratio). Because the behavior of a deduplicating storage system depends on the contents that are written to the system, developing standardized tools for evaluating deduplication performance is an open problem.

I am interested in finding ways to efficiently inject configurable data contents into an industry-standard benchmarking tool, such as Filebench. There are two thrusts to such a project. The first, which I've made progress on with Markus Feng '20, is to develop tools to measure the data entropy for real files on real systems; entropy measures the degree of disorganization in data, and entropy correlates with the data’s compression ratio. Using the tools we've deloped, we can better understand what features to emulate in the data that we emit in our benchmarking dataflows. The second thrust is to modify a tool like Filebench to produce a data stream with a specified entropy. In my preliminary experiments, I found that naively modifying Filebench introduced overhead to systems that did not rely on data contents; in other words, the very act of producing the data interfered with performance, regardless of whether the system cared about the data contents. Preliminary work on offloading the data production task to a GPU were promising, and progress on this project is ongoing. Please reach out if you are interested in benchmarking or deduplication; this project is a great candidate for a Williams undergraduate.