Final Project

CSCI 333 : Storage Systems

Spring 2021

Overview

The goal of the final project is to give us the chance to “dive deep” on a topic that we find interesting. As a secondary goal, a final project is a chance to simulate the experience of doing “real systems research”.

There are several ways that we can produce and share knowledge, and I hope that you choose the model that best fits your learning goals. For our final projects, I would like us to choose from among the following:

measurement/analysis,
knowledge synthesis/summary of knowledge,
data collection and presentation, and/or
implementation

Here is a brief explanation of what each of those options might look like:

Measurement/analysis: As computer scientists, we should all practice the scientific method: Ask a question, perform background research, form a hypothesis, test the hypothesis, draw conclusions, and communicate the results. If you wish to do measurement/analysis, your final project would start with us developing a hypothesis. Once we can articulate a hypothesis, we can design an experiment to test that hypothesis, or we could evaluate an algorithm or a system’s asymptotic performance using one of the performance models that we’ve discussed (DAM or affine models) and that the model predicts real-world performance. Although a complex set of tests may be too difficult or time consuming for our remaining time, there are definitely options: we can make a small change to an existing system and test its implications, or we could evaluate important performance measures of an existing system under different parameter settings/inputs. A great outcome doesn’t require fixing a problem; often simply identifying an opportunity is incredibly valuable.
Knowledge synthesis/summary of knowledge: We could read and learn about a topic, and present what we’ve learned in an accessible/creative way. There are traditional forms of disseminating knowledge; the USENIX organization provides templates for formatting research papers using Latex or MS Word. We have also read at least one blog post this semester (or we will soon: Ben Stopford’s entry on LSM-trees), and you’ve watched many videos of PowerPoint presentations on different storage topics. If you wish to do knowledge synthesis, your final project would start with finding a topic area and identifying a set of sources to read, selecting a presentation format (paper, web posting, or short presentation, podcast or interview), and outlining a plan for your document.
Data collection and presentation: In our unit on file system aging, we saw one paper that documented a long-term data collection effort, an analysis of that data, and the construction of a system that was then verified using that data analysis. We don’t have time for a longitudinal study or time to build a full system, but we can collect data that tells us something about the state of the world right now. If you wish to do data collection and summary/presentation, your final project would start with us discussing a data collection task, deciding upon a “data format”, and picking a set of systems on which to collect the data.
Implementation: Finally, you could implement something. This type of project would be like a traditional lab. You would not need to write a paper, but your README would need to clearly describe what you’ve done and document your work.

The final project is your chance to use the knowledge, tools, and techniques that we have developed this semester to explore a topic that interests you. Your final project should be about the size and scope of a lab/5-page paper/7-minute presentation. However you have the opportunity to define the project parameters, or you may choose to user or adapt one of the suggested projects below that fits your interests/goals.

Timelines

There is not much time left in the semester, so by Saturday, May 1 at noon, you should email me a project proposal, which I will give feedback on by our conference meeting on Tuesday May 4. The final deliverables should be completed by Sunday, May 23.

Proposals

You must propose a project by Saturday, May 1 at noon Proposals should be submitted using this template, with one proposal per group (1-3 students). When you submit your proposal, please copy all group members on the email.
I will review your proposals and return feedback to you before conference. If possible, we should brainstorm together—over email, slack, zoom, and during Thursday’s conference.
Your proposal needs to be detailed enough that it is clear to all of us what you plan to do. It should include a list of group members (if any), the topic, the format of your deliverables, and a “definition of success”. This is important so we can agree upon a project scope.
I want this project to be as fun and low-stress as a project can be: the format is flexible, the topic is flexible, and it will be evaluated using the criteria that we collectively define. If the project doesn’t sound fun to you, don’t propose it! We can come up with something fun together. If nothing sounds fun to you right now, we should talk about that too, and we can figure something out.

Final Project

The project deliverables should be submitted on or before Sunday, May 23 at 5pm.

Guidelines

Final projects may be:

Completed in groups of 1-3 students
- The project scope should scale with the size of the group (I expect a team of 3 students to cover a little bit more ground than 2 students, but it is not a linear relationship)
You may define your own project, or you may choose/modify a project idea from the examples below.
Depending on your project, your “deliverables” will vary. However, I will make a GitHub repository for each of group, and your group will commit your final project resources to that repository by the due date.

Sample Project Ideas

This is a non-exhaustive list of project ideas. Some of them have been inspired by topics in this course, and some of them have been projects from computer science courses elsewhere. I suggest that you use them for inspiration, even if you decide to design your own project: techniques or suggestions might be relevant.

Measurement/Analysis

Hardware measurement and benchmark design from an OS course at UNC
Recreate the evaluation of a published research paper, and either confirm or dispute their results. Good candidates for this type of project are papers that have published their benchmarks and have provided clear descriptions of their methodology. Or, papers that measure the state of the world at some point in time, and you want to see if the world has changed since then.
Recreate the “Bandwidth vs. I/O size curve” that I used to motivate seek costs on Hard drives (and later SSDs), but this time use a “cloud service” as your “disk”. The goal would be to answer the question: “What is the natural transfer size for different cloud storage services?”
Download the code for a filter implementation (Bloom, Cuckkoo, or quotient filter), and experimentally verify that the parameter settings produce the desired false positive rates.

Knowledge Synthesis/Summary of Knowledge

A survey of the challenges of building systems that are compliant with data privacy regulations. For example:
- What is the GDPR and what are its implications at big companies like Facebook/Google?
What are the challenges of building a HIPAA compliant storage system at a hospital, when doctors in different departments and different hospitals may want to access a patient’s data? What approaches do existing systems take?
What are the challenges for designing a system like MOSS—one that relies on collecting large user data sets for plagiarism detection—to ensure that it is FERPA-compliant?
A survey of crash recovery techniques, like write-ahead logging, transactions, and fsck, and a discussion fo which systems use which approaches and why.
A survey of different LSM-tree designs and techniques. This is a topic that I have strong thoughts about, and would love to share interesting papers.
A presentation/notes for one of the textbook chapters that we didn’t cover in class
A case study on how git works. The internals (Ch 10) are quite interesting! Comparing against other version control systems, with their pros and cons.

Straight Implementation

Implement a Bloom filter, and verify its properties (e.g., false positive rate w.r.t. k and m)
Implement a compression algorithm from scratch.
Extend your FUSE file system in one or more interesting ways, such as:
- Add encryption and/or compression (this can be surprisingly simple to do, and it can be scaled up or scaled down depending on your appetite for complexity)
- Add some form of support for data caching
Implement a new FUSE file system that does something interesting or fun. I would imagine these would be simpler and more in line with modifying the fuse_xmp pass-through file system template. For example:
- Musical FUSE: writes to files proceed as normal, but reads use an external python library to “play” the contents of the files.
- A pseudo filesystem for printing PDF documents. One idea is to have two directories: one for writing single-sided and one for double-sided. There could be a single psuedo-file called “print”, and when I use the “write” system call to write a filename to “print”, the computer would print the contents with the appropriate sided-ness to the default printer (the printer could be specified as a parameter during mount). This would be spiritually similar to the hello-FUSE lab.
- A FUSE file system that operates as a pass-through file system, but issue git commits/pushes/pulls to keep your contents synced with a remote repository. This could be an interesting way to collaborate on a project. A challenge here would be handling merge conflicts, but this project seems both straightforward and fun!
- Implement file-level compression on a “pass-through” style FS (i.e., you read and write using an existing file system, but you compress the file contents)
Implement hardware conditioning scripts, and quantify the performance differences between fresh and aged devices. Here is a link to the SNIA document with details.
Implement a variant of rsync

Data Collection

Collect summary statistics about modern storage systems. Things like compressibility, directory structure, file sizes, deduplication rations, etc. are interesting to practitioners.
Measure the fragmentation on the lab systems or your laptop using the filefrag tool, and do some analysis of the data (average fragmentation levels, fragmentation by file type, fragmentation by file size, etc.). What patterns do you see? How might these patterns result in recommendations for data management?