Spring 2021
The goal of the final project is to give us the chance to “dive deep” on a topic that we find interesting. As a secondary goal, a final project is a chance to simulate the experience of doing “real systems research”.
There are several ways that we can produce and share knowledge, and I hope that you choose the model that best fits your learning goals. For our final projects, I would like us to choose from among the following:
Here is a brief explanation of what each of those options might look like:
Measurement/analysis: As computer scientists, we should all practice the scientific method: Ask a question, perform background research, form a hypothesis, test the hypothesis, draw conclusions, and communicate the results. If you wish to do measurement/analysis, your final project would start with us developing a hypothesis. Once we can articulate a hypothesis, we can design an experiment to test that hypothesis, or we could evaluate an algorithm or a system’s asymptotic performance using one of the performance models that we’ve discussed (DAM or affine models) and that the model predicts real-world performance. Although a complex set of tests may be too difficult or time consuming for our remaining time, there are definitely options: we can make a small change to an existing system and test its implications, or we could evaluate important performance measures of an existing system under different parameter settings/inputs. A great outcome doesn’t require fixing a problem; often simply identifying an opportunity is incredibly valuable.
Knowledge synthesis/summary of knowledge: We could read and learn about a topic, and present what we’ve learned in an accessible/creative way. There are traditional forms of disseminating knowledge; the USENIX organization provides templates for formatting research papers using Latex or MS Word. We have also read at least one blog post this semester (or we will soon: Ben Stopford’s entry on LSM-trees), and you’ve watched many videos of PowerPoint presentations on different storage topics. If you wish to do knowledge synthesis, your final project would start with finding a topic area and identifying a set of sources to read, selecting a presentation format (paper, web posting, or short presentation, podcast or interview), and outlining a plan for your document.
Data collection and presentation: In our unit on file system aging, we saw one paper that documented a long-term data collection effort, an analysis of that data, and the construction of a system that was then verified using that data analysis. We don’t have time for a longitudinal study or time to build a full system, but we can collect data that tells us something about the state of the world right now. If you wish to do data collection and summary/presentation, your final project would start with us discussing a data collection task, deciding upon a “data format”, and picking a set of systems on which to collect the data.
Implementation: Finally, you could implement something. This type of project would be like a traditional lab. You would not need to write a paper, but your README would need to clearly describe what you’ve done and document your work.
The final project is your chance to use the knowledge, tools, and techniques that we have developed this semester to explore a topic that interests you. Your final project should be about the size and scope of a lab/5-page paper/7-minute presentation. However you have the opportunity to define the project parameters, or you may choose to user or adapt one of the suggested projects below that fits your interests/goals.
There is not much time left in the semester, so by Saturday, May 1 at noon, you should email me a project proposal, which I will give feedback on by our conference meeting on Tuesday May 4. The final deliverables should be completed by Sunday, May 23.
You must propose a project by Saturday, May 1 at noon Proposals should be submitted using this template, with one proposal per group (1-3 students). When you submit your proposal, please copy all group members on the email.
I will review your proposals and return feedback to you before conference. If possible, we should brainstorm together—over email, slack, zoom, and during Thursday’s conference.
Your proposal needs to be detailed enough that it is clear to all of us what you plan to do. It should include a list of group members (if any), the topic, the format of your deliverables, and a “definition of success”. This is important so we can agree upon a project scope.
I want this project to be as fun and low-stress as a project can be: the format is flexible, the topic is flexible, and it will be evaluated using the criteria that we collectively define. If the project doesn’t sound fun to you, don’t propose it! We can come up with something fun together. If nothing sounds fun to you right now, we should talk about that too, and we can figure something out.
The project deliverables should be submitted on or before Sunday, May 23 at 5pm.
Final projects may be:
Completed in groups of 1-3 students
You may define your own project, or you may choose/modify a project idea from the examples below.
Depending on your project, your “deliverables” will vary. However, I will make a GitHub repository for each of group, and your group will commit your final project resources to that repository by the due date.
This is a non-exhaustive list of project ideas. Some of them have been inspired by topics in this course, and some of them have been projects from computer science courses elsewhere. I suggest that you use them for inspiration, even if you decide to design your own project: techniques or suggestions might be relevant.
Hardware measurement and benchmark design from an OS course at UNC
Recreate the evaluation of a published research paper, and either confirm or dispute their results. Good candidates for this type of project are papers that have published their benchmarks and have provided clear descriptions of their methodology. Or, papers that measure the state of the world at some point in time, and you want to see if the world has changed since then.
Recreate the “Bandwidth vs. I/O size curve” that I used to motivate seek costs on Hard drives (and later SSDs), but this time use a “cloud service” as your “disk”. The goal would be to answer the question: “What is the natural transfer size for different cloud storage services?”
Download the code for a filter implementation (Bloom, Cuckkoo, or quotient filter), and experimentally verify that the parameter settings produce the desired false positive rates.
A survey of the challenges of building systems that are compliant with data privacy regulations. For example:
What are the challenges of building a HIPAA compliant storage system at a hospital, when doctors in different departments and different hospitals may want to access a patient’s data? What approaches do existing systems take?
What are the challenges for designing a system like MOSS—one that relies on collecting large user data sets for plagiarism detection—to ensure that it is FERPA-compliant?
A survey of crash recovery techniques, like write-ahead logging, transactions, and fsck
, and a discussion fo which systems use which approaches and why.
A survey of different LSM-tree designs and techniques. This is a topic that I have strong thoughts about, and would love to share interesting papers.
A presentation/notes for one of the textbook chapters that we didn’t cover in class
A case study on how git
works. The internals (Ch 10) are quite interesting! Comparing against other version control systems, with their pros and cons.
Implement a Bloom filter, and verify its properties (e.g., false positive rate w.r.t. k
and m
)
Implement a compression algorithm from scratch.
Extend your FUSE file system in one or more interesting ways, such as:
Implement a new FUSE file system that does something interesting or fun. I would imagine these would be simpler and more in line with modifying the fuse_xmp pass-through file system template. For example:
Implement hardware conditioning scripts, and quantify the performance differences between fresh and aged devices. Here is a link to the SNIA document with details.
Implement a variant of rsync
Collect summary statistics about modern storage systems. Things like compressibility, directory structure, file sizes, deduplication rations, etc. are interesting to practitioners.
Measure the fragmentation on the lab systems or your laptop using the filefrag
tool, and do some analysis of the data (average fragmentation levels, fragmentation by file type, fragmentation by file size, etc.). What patterns do you see? How might these patterns result in recommendations for data management?