The File System API

CSCI 333 : Storage Systems

Spring 2021

Background

Before diving into this unit, it is helpful to refresh you understanding of these topics that were covered in CSCI 237:

The File System API

FS System Calls

If we focus on Linux, there are many system calls (~300), but only some of them specifically relate to the storage subsystem. The system calls discussed in OSTEP Chapter 39 are enumerated below, and they are the ones that you will likely use most in this class. You should be familiar with all of them: both how to use them (or how to look up their usage using Unix man pages) and how they affect the state of the storage system’s key data structures. Memorization is much less important than thinking about why the interface is designed the way it is.

Discussion Questions (syscalls)

Types of Identifiers

There are multiple ways to refer to files, and each identifier has its own advantages and disadvantages. You should be familiar with the types of identifiers that are passed to each FS-related system call, and why that particular identifier is used.

What logical objects do each of the following types of identifiers refer to?

Discussion Questions (identifiers)

files (the data structure)

Each process contains a private table that maps file descriptors (per-process integer identifiers) to file data structures. So we should try to be precise when we use the term file: colloquially, file has a meaning that is similar-to-but-very-different-from the file data structure that is part of the file system API in the OS. Unfortunately always being precise is difficult, so context should be helpful when determining what is meant by the word file.

Discussion Questions (files)

directories

Directories do not store “data” in the typical sense. Directories are a particular type of file that contains a mapping from pathnames to inode numbers.

Discussion Questions (directories)

open

If the process has sufficient permissions to access a file, the open() system call creates an entry in the process’s file table so that the process can interact with the file. The file desriptor returned by open is an index into that table. The same “colloquial file” can be opened multiple times, and each file structure instance keeps helpful state about the process’s interactions with that “colloquial file”, including a current offset and the “access mode” (e.g., read-only).

Discussion Questions (open)

Indirection is a very powerful tool. There are two useful types of indirection provided by links.

File Systems and Trees

Directories give the file system namespace a hierarchical structure. This gives a powerful way to express relationships between objects. These relationships are often used by file systems to influence their low-level data placement/organization policies.

We use the mount and unmount system calls to create a unified namespace from a set of independent file systems: mount takes the root of a file system and attaches it to a directory in the global namespace. Unmounting a file system disconnects the root of that file system from the namespace, making that file system’s directory tree unreachable.

Discussion Questions (trees)

fsync

Caching is an important tool for improving file system performance. Yet caching data exposes the system to data loss: what if the machine crashes before the cached data is written? It is important that applications have a way to enforce reliable guarantees so that they can protect themselves from corruption.

Discussion Questions (fsync)

rename

Renaming a file is a seemingly simple task. Yet the deeper you dive into the rename system call, the more interesting it becomes. rename is the first time we encounter the concept of atomicity. An atomic operation is one that either happens completely and all at once, or it does not happen at all. In an atomic operation, no intermediate state is ever revealed. Fo rename, what that means is that the file either exists at its original location or at its new location—it is never in both places and it is never “gone”.

Discussion Questions (rename)

lseek

The lseek system call updates a file data structure’s internal offset. This is useful for issuing non-sequential reads and writes (commonly referred to as random reads and writes, even when the operations are not random in the mathematical sense).

Discussion Questions (lseek)