Assignment 1

CSCI 333 : Storage Systems

Spring 2021

Learning Objectives

This assignment serves several purposes, and as a result, it may pose different challenges to each of us. After completing the tasks in this assignment, we will:

Assignment Overview

In this assignment, we will be implementing our own versions of important Unix utilities. These are useful utilities that we should all familiarize ourselves with for use in our everyday programming lives, but they are also utilities that may come in handy if you want to analyze/benchmark the performance of real systems (one possibility for a final project may be to extend some of these utilities in interesting/useful ways). Your current task is to write some re-imagined versions of the

programs.

Along the way, we will also become familiar with writing tests to verify that our programs work as expected. Comprehensive testing is incredibly important when writing complex programs: although we often work on small independent features, many system components use others as building blocks, and they depend upon those other components’ correctness. We can’t build an effective house when the foundation has cracks, and it is much harder to fix a problem after the fact.

Assignment Logistics

For this assignment, you will submit your own versions of all code. Each student will be given a private repository under the Williams-CS GitHub organization, and only you will have read/write access to that repository. You should be committing your code to this repository as you make progress. I highly recommend that you commit code early and often: it will only help because the teaching staff can view your code and easily answer questions using the GitHub interface.

Collaboration

Although you must write all code yourself, you are not expected to work in isolation. You are permitted to discuss the assignment with classmates under the following guidelines.

You may discuss:

Starter Code

There is no “starter code” for this assignment, but you will find empty .c files that you will fill in with your assignment solutions. In addition to the four empty .c files, each repository will contain a README.md file with additional links and assignment details, and an Eval.md file that you will complete as described below.

Submission

You will submit five individual C files (they should end in the .c extension), one for each operation/task. Each file should have a main function that, when run, performs the behavior described for that task. In addition, there should be a corresponding “test script” for each C program. This will likely be a “bash script”, but there are other approaches you may wish to explore (e.g., a python script). You repository’s README.md is the place to describe how to run your programs and your program tests.

In addition to your code, you should complete a self-evaluation in Eval.md that reflects on your assignment. More details are included at the end of this document. If you have any questions about this part of the assignment or how it will be used, please ask.

Assignment Tasks

(Ongoing Task) Build familiarity with libc

There are several C standard library (often called libc) routines that will be particularly helpful for implementing the programs in this assignment. All C code is automatically linked with “the C standard library”, which is full of useful functions. You can learn more about the C library here.

There is a lot of code at your disposal thanks to the C library, so at various points you will need to consult the documentation. Sometimes documentation alone is not enough to understand the implementation details, and looking at the source code fills in the gaps. One slightly more readable version of C library code is the musl libc project. If you ever want to explore implementation details, or perhaps even modify some libc functionality for your own purposes, I encourage you to read and play around with the musl libc source code.

(In Class) Calling System Calls

Although we will walk through this part of the assignment together in class, you should independently complete this portion of the task yourself and execute the commands as practice. We’ve included detailed instructions so that you can recreate the steps, and you may refer as often as you’d like to your notes from class.

Applications rarely call system calls directly; instead, applications often call libc library functions that internally package arguments and correctly invoke the actual system call’s “magic incantation” (each OS may define its own protocols for invoking system calls), which ultimately passes control to the operating system kernel to perform the requested task on the application’s behalf. In this first task, we will use the stat() libc library function to ask the operating system for a single file’s details, and then we will display those details similarly to the way that Unix stat utility presents them.

For this task (and many others like it), our first inclination should be to read the documentation. In particular, we want to understand the behavior of the stat() routine. To do this, open a terminal and type:

 $ man stat

This command immediately opens the manual page inside the terminal, but don’t fret: when you exit the manual, the terminal state that you just left will be restored.

On Ubuntu 20.04, the top of my terminal reads:

STAT(1)               User Commands               STAT(1)

NAME
       stat - display file or file system status

SYNOPSIS
       stat [OPTION]... FILE...

DESCRIPTION
       Display file or file system status.

...

ASIDE: Note that on macOS, the man page is slightly different:

STAT(1)          BSD General Commands Manual          STAT(1)

NAME
   readlink, stat -- display file status

SYNOPSIS
   stat [-FLnq] [-f format | -l | -r | -s | -x] [-t timefmt] [file ...]
   readlink [-n] [file ...]

DESCRIPTION
   The stat utility displays information ...

This is important to realize because, although we often think of “Unix-like” operating systems as interchangeable, there are both subtle and not-so-subtle differences that occasionally pop up. This is one of the reasons we spent so much time setting up a uniform environment during Lab 0.


Effectively navigating man pages takes practice, both to understand the format and to navigate the interface. The command `man stat` at the command line will open up the manual page in the same interface as the less command. Use the up/down arrows to navigate and type the single letter q to quit. In fact, I suggest using the command `man less` to explore how to navigate man pages! Your time investment will quickly pay off. (Also note: many shortcuts in less are also shortcuts in emacs and at the command line, so being familiar with Unix tools and keyboard shortcuts will make you more efficient in all aspects of systems programming!)

Notice that the first line starts with “STAT(1)”. The Unix manual is broken into several sections. The 1st section describes general commands, the 2nd section describes system calls, and the 3rd section describes library functions (including libc). You can specify a section when searching the manual to ensure that you get the documentation that you want. The syntax is `man <SECTION_NUMBER> <FUNCTION_NAME>`). So although `man stat` opens a useful manual page, it isn’t the page we want just yet. Instead, we want to look at the stat() library function.

To open the stat() library function’s man page, execute:

$ man 2 stat

On Ubuntu 20.04, my terminal reads:

STAT(2)               Linux Programmer's Manual               STAT(2)

NAME
       stat, fstat, lstat, fstatat - get file status

SYNOPSIS
       #include <sys/types.h>
       #include <sys/stat.h>
       #include <unistd.h>

       int stat(const char *pathname, struct stat *statbuf);
       int fstat(int fd, struct stat *statbuf);
       int lstat(const char *pathname, struct stat *statbuf);

       #include <fcntl.h>        /* Definition of AT_* constants */
       #include <sys/stat.h>

       int fstatat(int dirfd, const char *pathname,
                   struct stat *statbuf, int flags);

   Feature Test Macro Requirements for glibc (see 
                                          feature_test_macros(7)):

       lstat():
           /* glibc 2.19 and earlier */ _BSD_SOURCE
               || /* Since glibc 2.20 */ _DEFAULT_SOURCE
               || _XOPEN_SOURCE >= 500
               || /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L

       fstatat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE

DESCRIPTION

       These functions return information about a file, in the 
       buffer pointed to by statbuf.  No permissions are required
       on the file itself, but—in the case of stat(), fstatat(), and
       lstat()—execute (search) permission is required on all of the
       directories in pathname that lead to the file.

   ...

We can see that the STAT(2) manual page contains entries for multiple related library functions, including fstat(), lstat(), fstatat(), and stat(). The SYNOPSIS section shows the parameters and return types for each. We also see that some C library header files (sys/stat.h, sys/types.h, and unistd.h) must be included in order to call any of these functions. There are many other useful sections, including DESCRIPTION, which gives high-level information about each function, RETURN VALUES, which explains how to interpret the various return values (this is particularly important since many functions we will use have values to signal complete AND partial success as well as a variety of failure modes). The SEE ALSO section often helps when the function is almost what you’re looking for, but not quite the solution you need.

Scroll down and skim the various sections to get a sense for the information they contain. Manual pages have a relatively standardized format (most follow the same conventions), so this exploration will definitely pay dividends later.

A “simple” stat program

Let’s use the information from manual page to help us write a program. We’ll use this program for two things.

  1. We will get some practice writing C programs that use libc functions, and
  2. We will explore how to write a short test of correctness.

In order to do this, we will create a specification that is some contrived and elaborate, but we’ll keep an eye towards exploring concepts we’ve encountered in our first unit.

We will write a short utility that takes a single argument (a file’s path), calls stat(), and prints specific information about the file specified. In particular, we will print the file’s name, inode number, link count, and size. These are pretty important fields, but they are a subset of the total information that the stat() library function/system call gives. Checking the man page shows how much information there actually is…

Note: to round out a real spec, we’d need to define the expected output format—on both success and failure—so that we can write our tests to verify we conform to the spec exactly. We’ll do that for later programs.

We start by creating a file called my-stat.c and including the following code:

#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>

void usage() {
    printf("Please provide a pathname as input.\n");
    printf("Example usage:\n");
    printf("\t./my-stat stat.c\n");
}

/**
 * Function that calls the stat library function on a single path, and
 * if that path exists, outputs select file information.
 */
int do_stat(char *path) {
    int ret;

    // We allocate a `struct stat` on the stack for two reasons
    //   1) we only need to access the contents inside this function
    //   2) when we return from do_stat, the memory is automatically freed
    struct stat statbuf;

    // we call stat on our path, and pass the address of our struct stat
    // variable so that the function can fill in the information it needs
    // This is a very common pattern in systems code: the thing we "return"
    // is a status code (success/failure), and the relevant information is
    // written into some structure in memory
    ret = stat(path, &statbuf);
    if (ret < 0) {
        perror("stat error:");
        return errno;
    }

    // We print our information in the same format that the `stat` program
    // does. This is a contrived example; we just do this to make writing
    // our "correctness test" easier.
    printf("name:\t%s\n", path);
    printf("ino:\t%lu\n", statbuf.st_ino);
    printf("nlink:\t%lu\n", statbuf.st_nlink);
    printf("size:\t%lu\n", statbuf.st_size);

    return ret;
}

int main(int argc, char **argv) {
    if (argc < 2) {
        usage();
        exit(errno);
    }

    return do_stat(argv[1]);
}

Let’s take a minute to examine the code.

  1. It includes several header files, which are needed for some core functionality like printing and the stat() function itself. When including files in C, specifying files with brackets, e.g., #include <name.h>, tells the compiler to look on the “include path”, which you can expand by passing the -I flag to gcc. If you don’t use brackets, e.g., #include "name.h", then gcc looks for the file in the current working directory.
  2. It has a usage() function. This is common. We often start by checking for the correct number and types of arguments, and if there is a mismatch, we stop our program and let the user know what went wrong. Often there are very specific requirements for our program output. It is important to follow the program specification exactly because scripts that use our programs may rely on the spec for their own correctness. Here, we are writing a novel program so we can do what we want; I just made the output informative.
  3. Once we parse our arguments, our actual work is done in the do_stat() function. Make your programs modular! In C, this is even more important because we have to manage our own memory, and small functions helps us to reason about the lifetime/scope of variables.
  4. The local variable statbuf has type struct stat, which isn’t defined anywhere in our program. That is because struct stat is defined in <sys/stat.h>. The structure’s fields are described in the stat() manual page, so look there for more details. Notice that we only print a subset of the fields; we chose these fields because they came up in our first unit.
  5. Our program always checks the return values of (non-printf) functions. We need to catch errors early so we can reason about their source and stop our program’s execution before we propagate bugs. For example, suppose we were modifying file data; we wouldn’t want to continue our program if we realized that we gave the wrong file name: we might clobber important file contents!

Now that we’ve examined the code, try compiling your program using gcc as follows:

 $ gcc -o my-stat my-stat.c -Wall -Werror

This will make a single executable binary called my-stat, which you can then run as shown above (./my-stat <filename>). The -Wall part of the command says to print all warnings, and the -Werror part says to treat all warnings as errors. In other words, you should write a program that follows all of the recommended rules.

You should be able to execute your program as follows:

 $ ./my-stat my-stat.c
 name:   my-stat.c
 ino:    2098095
 nlink:  1
 size:   1491

You may see different outputs on your own file system, but try comparing it with the output of `$ stat my-stat.c`.

A not-so-simple stat test

Now that we’ve written a program, we want to test it. In the case of our my-stat program, it is probably excessive to write a test that is as long as the program itself, but we’ll use this basic test as an example to explore the tools and techniques we’ll use for testing some of the more complex programs that we’ll write later.

The first question we want to ask ourselves is: What do we want to test?

Typically, we want to verify that our program behaves correctly when our program is run correctly, but we also care about the behavior of our program when there are errors. We’ll skip error cases for now, but we’ll handle them in our later programs.

Note: we don’t have a formal spec for my-stat, so we can’t be as thorough as we will ask you to be later in this assignment.

So for our first test, we will verify that our program behaves as expected under correct usage, and in doing so, we’ll use some techniques that we can later expand upon when we want to be thorough.

Create a file called stat-test.sh, and include the following code:

#!/bin/bash

# This test runs both the `stat` utility and our sample "stat-like" program.
# The test uses file redirection to write the outputs to files,
# which it then compares to verify that the outputs match.

# We declare and "hard-code" some variables so we can easily run our program
# without taking arguments.
FILE="my-stat.c"                  # the file we will "stat"
STAT_OUT="${FILE}.stat.log"       # where we will place "correct" output
MY_STAT_OUT="${FILE}.my-stat.log" # where we will write our "test" output

# An example bash function.
# This function is called "test_existence".
# It takes a single argument: the file whose existence to test for.
# If the file DOES exist, it exits the script with a helpful message
test_existence() {
    if [ -f "$1" ]; then
        echo "ERROR: \"$1\" exists. Please clean up before running test."
        exit 0
    fi
}

## Start our test by making sure our "output files" don't exist already.
## We do this because we don't want our test to be influenced by leftover
## data from previous runs.
test_existence "${STAT_OUT}"
test_existence "${MY_STAT_OUT}"


## We'll use the Unix stat utility as a baseline.
## We extract the relevant info and append the output to a file.
## (Note that one `>` first clears the target file, then writes to it
##   whereas two `>>` appends to the end of the target file.)
stat --printf="name:\t%n\n" ${FILE} > ${STAT_OUT}
stat --printf="ino:\t%i\n" ${FILE} >> ${STAT_OUT}
stat --printf="nlink:\t%h\n" ${FILE} >> ${STAT_OUT}
stat --printf="size:\t%s\n" ${FILE} >> ${STAT_OUT}

## Now do the same for our program: append its output to a file
./my-stat ${FILE} > ${MY_STAT_OUT}

## to check the return value of the most recent command,
## we can use the special $? variable. Note that the value
## of $? changes after every command, so we may want to
## save it in a variable so that we can refer to it
## multiple times (since using the value $? often means
## calling an expression, which results in the value of $?
## changing to reflect the return value of *that* expression)
ret=$?  # store the return value of `./my-stat ${FILE} > ${MY_STAT_OUT}`
if [ $ret -ne 0 ] ; then
    echo "return value does not match expected value."
    echo "expected: 0, received: $ret"
    exit 1
fi

## now use cmp (or diff) to compare
##    (use `man cmp` or `man diff` for more details)
## a return value of 0 means there was no difference.
## Important note: bash treats 0 as true and all other 
##    values as false!
## (We could also use the trick above to store the return value
## of cmp -s ${STAT_OUT} ${MY_STAT_OUT}, and then compare it
## against 0)
if cmp -s ${STAT_OUT} ${MY_STAT_OUT} ; then
    echo "success!"
    # Since the file succeeded, let's clean up our log files
    rm ${STAT_OUT} ${MY_STAT_OUT}
else
    echo "output does not match!"
    # leave log files around for debugging
fi

Although there are many comments throughout, we’ll emphasize some details here and fill in some gaps.

  1. First, some details about bash syntax and executable files. The first line you see is #!/bin/bash. In bash, the hashtag (#) is the comment character. If the first line starts with a “hash bang” (#!), the following path specifies what shell to use. In many Unix-like OSes, the /bin/ directory is where many binary executables are located when you install them, and bash is the name of our system’s default shell.

  2. In bash, variable assignment is done without spaces between the variable name, equals sign, and value.

  3. In bash, we can yield the value of a variable using $VARNAME or ${VARNAME}. I often prefer ${VARNAME} so that I can unambiguously combine the variable with other strings, like ${FILE}.stat.log, which combines the value of the ${FILE} variable with the string “.stat.log” to create a formatted file name.

  4. Function definitions do not include a set of parameters inside the (), but you can access any parameters that were passed when calling the function by using numbered variables ($1 is the first var, $2 is the second, etc.).

  5. Conditional statements take many forms. Inside test existence, the conditional

    if [ -f "$1"]; then

    checks whether the value of the function’s first argument ("$1") is the pathname of an existing file (using -f inside [ ] returns TRUE if the target exists and is a regular file). Here is a good bash reference with a linked table of contents that makes searching for the correct syntax for your target task easier.

  6. We call a function by specifying its name followed by 0 or more arguments. When combined with the lack of specification of arguments in the function definition, we see that there isn’t any type checking or enforcement of argument counts! This is why we don’t often write long or security-critical programs as bash scripts.

  7. We can call other programs as part of our script. For instance, we see the line:

    $ stat --printf="name:\t%n\n" ${FILE}

    calls the stat program and then provides some arguments, some of which may be “hard coded” and others may be variables. To specify that we want to call a program that is in the current directory, like the executable my-stat, we can specify that with “./” e.g., ./my-stat. Recall that a single dot (.) refers to the current directory.

  8. Every bash command/program yields a return value. We can use this to determine if commands succeed or not. For example, the cmp utility compares the contents of two files. If we read the manual page for cmp, we can check the return value’s meaning. Conditionally executing code in reaction to the return values of functions/utilities is a very useful strategy that we will utilize in all of our tests!

  9. The > and >> symbols let us do “input/output redirection”. The command on the left side is run, and its output (both standard output and error output) is written to the file specified on the right side (which is created if the file does not yet exist). When using one >, the right side file is truncated (its size is trimmed to 0), and the the output is written to that now empty file. When using two >>, the right side file is left in tact, and the output is appended at the end.

Putting it all together

Now we have two things:

  1. A C program called my-stat.c (and a compiled version called my-stat)
  2. A bash script called stat-test.sh

Let’s try them out. We’ve compiled and run my-stat.c, so let’s try the test. Type:

$ ./stat-test.sh
bash: ./stat-test.sh: Permission denied

This doesn’t work! Why is that? Let’s get meta, and examine the stat-test.sh’s stat output.

$ stat stat-test.sh
  File: stat-test.sh
  Size: 2020        Blocks: 8          IO Block: 4096   regular file
Device: 811h/2065d  Inode: 2097323     Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/    bill)   Gid: ( 1000/    bill)
Access: 20XX-XX-XX 12:19:12.588338617 -0500
Modify: 20XX-XX-XX 12:19:04.188607737 -0500
Change: 20XX-XX-XX 12:19:04.188607737 -0500
 Birth: -

In the first “Access” line, we see the string (0664/-rw-rw-r--). The left side of the / (i.e., 0664) shows the file permissions in octal format. The last three digits represent the permissions granted to the file’s owner/user, the file’s group, and the world, respectively. The right size of the / (i.e., -rw-rw-r--) shows a more visual representation of those permissions (each octal digit on the left corresponds to three bits, and those bits represent the read (r), write (w), and executable (x) permissions). Note that there are no x’s in the permissions. We need to make our script executable!

We can change the mode (which includes the permissions) of a file using the chmod utility. How? Look at the man pages! I like to use the incremental approach, so I would type:

$ chmod u+x stat-test.sh

to give the user execute permissions. Now if I stat the file:

$ stat stat-test.sh
  File: stat-test.sh
  Size: 2020        Blocks: 8          IO Block: 4096   regular file
Device: 811h/2065d  Inode: 2097323     Links: 1
Access: (0764/-rwxrw-r--)  Uid: ( 1000/    bill)   Gid: ( 1000/    bill)
Access: 20XX-XX-XX 12:19:12.588338617 -0500
Modify: 20XX-XX-XX 12:19:04.188607737 -0500
Change: 20XX-XX-XX 12:19:04.188607737 -0500
 Birth: -

I see that I can now execute it, which I’ll do by typing:

$ ./stat-test.sh
success!

Hooray!

Wrapping up our example

Now that we’ve written and (loosely) tested a program together, the rest of the assignment asks you to replicate this process for incrementally more interesting programs. Some of these programs mimic existing Unix utilities, and for those programs, we can use similar strategies that we used above in order to compare our output against a working target. But unlike we did for our “stat” example, we need to be sure that our programs faithfully match a program specification. This includes the behavior of correct as well as incorrect executions; in other words, your error messages should match the expected error messages too.

Program 1: my-cat

Our first program, which we will call my-cat, is a simplified version of the standard Unix program cat. At a high level, cat reads the file specified by the user and prints that file’s exact contents to standard output.

In the following example of my-cat usage, where a user wants to see the contents of the file my-stat.c, the user would pass my-stat.c as the only command-line argument, and the exact contents my-stat.c would be printed:

$ ./my-cat my-stat.c
#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>

void usage() {
...the rest of my-stat.c is also printed to standard output...

Like we saw above, the “./” before the my-cat above is a Unix thing; it tells the system which directory to find my-cat in. (Recall that the “.” (single dot) is a synonym for the current working directory. Since “/” is the character used to separate different components of a pathname, “./my-cat” means look for the file my-cat inside the current directory).

To create the executable my-cat binary, you’ll be editing a single source file, my-cat.c, and writing C code that implement this simplified version of the cat utility. To compile the program, you can execute the following command:

$ gcc -o my-cat my-cat.c -Wall -Werror

This will make a single executable binary called my-cat, which you can then run as shown above (./my-cat <filename>). The -Wall part of the command says to print all compiler warnings, and the -Werror part says to treat all compiler warnings as errors. In other words, these flags force us to write a program that follows all of the recommended rules.

Helpful Commands for my-cat

In our first unit, we went over a subset of the system calls that comprise the “File System API”. There are other ways to interact with files and your file system than those system calls. (Although, when “the rubber meets the road”, the other APIs ultimately call the FS API system calls internally.)

For my-cat, we recommend using the following C library routines for file input and output: * fopen() * fgets(), and * fclose()

As we’ve emphatically stressed above, the first thing you should when encountering a new function is to read the function’s manual page. (The above functions are all “library functions”, so they are in manual section 3 if you need to disambiguate.)

So, to access the man page for fopen(), you could type the following at your UNIX terminal prompt:

$ man fopen

Although you should definitely explore the man pages, we will also give a short overview of important commands here.

The fopen() function “opens” a file (similarly to open), except it doesn’t yield an integer file descriptor. Instead, the fopen() command yields a pointer to a FILE struct, which can then be passed to other library routines that read, write, etc. Most of these library functions start with an “f” (as we see with fopen(), fclose(), fgets()).

Here is a short example of fopen() usage:

FILE *fp = fopen("main.c", "r");
if (fp == NULL) {
        printf("cannot open file\n");
        perror();
        exit(1);
}

A couple of points here:

  1. First, note that fopen() takes two arguments: the name of the file and the mode. These arguments are used the same way that they are in the open() function. The mode restricts the ways that the FILE structure can be used to interact with the file. In this case, because we only wish to read the file, we pass "r" as the second argument. Read the man pages to see what other options are available. In general, you should always grant the minimum privileges necessary to perform your task; this eliminates common errors by letting you state your intent, and giving C the information it needs to enforce your intent.

  2. Second, note the critical step of checking whether the fopen() function actually succeeded. Unlike Java, C will not throw an exception when things go wrong; rather, C expects that (in good programs, i.e., the only kind you’d want to write) you always will check whether the call succeeded. Reading the manual page tells you the details of what is returned on success (often a range of values indicate success, since partial successes may be possible) and on error (often -1, with an appropriate errno value set to indicate why the function failed); in this case, the Ubuntu 20.04 manual page says:

    Upon successful completion fopen(), fdopen(), and freopen()
    return a FILE pointer.  Otherwise, NULL is returned
    and errno is set to indicate the error.

    Thus, as the code above does, please check that fopen() does not return NULL before trying to use the FILE pointer it returns.

  3. Third, note that when the error case occurs, the program prints a message and exits with error status of 1. In Unix systems, it is traditional to return 0 upon success, and return non-zero upon failure. Here, we will use 1 to indicate failure.

Aside: if fopen() does fail, there are many possible reasons why. You can use the functions perror() and/or strerror() to print human-readable messages that describe why the error occurred; learn about those functions on your own (using … you guessed it … the man pages!), and use them in your assignment!

Once a file is opened, there are many different ways to read from it. The one we suggest that you use is fgets(). fgets() reads data from files into a user-supplied buffer until it reaches the specified limit or the end of the file. You could implement this same functionality using open()/read()/close(), but we’ll be using other “f” functions later, so we might as well practice.

To print out file contents, just use printf(). For example, after reading in a line with fgets() into a variable buffer, you can just print out the buffer as follows:

        printf("%s", buffer);

Note that you should not add a newline ("\n") character to the printf() format string, because then your output would diverge from the file’s true contents. Just print the exact contents of the read-in buffer (which will likely include newlines).

Finally, when you are done reading and printing, use fclose() to close the file, thus indicating you no longer need to read from it.

The Official my-cat Spec

Program 2: my-grep

The second utility you will build is called my-grep, a variant of the Unix tool grep. At a high level, grep scans through a file, line by line, trying to find a user-specified search term in each line. If a line contains the search term, that line is printed out; otherwise the line is skipped. Then grep continues searching subsequent lines.

Standard grep is quite customizable with many command line many options to tweak its behavior. Your my-grep will just implement the basic functionality described above. Here is how a user would look for the term “foo” in the file bar.txt:

$ ./my-grep foo bar.txt
this line has foo in it
so does this foolish line; do you see where?
even this line, which has barfood in it, will be printed.

This may seem like a silly thing to implement, or even a waste of time. I assure you that it is not. I (Bill) have personally used Unix grep as a system evaluation benchmark in multiple published papers. The grep utility implements a linear scan through a file (or if passed the -r flag, a recursive linear scan through a directory subtree in modified breadth-first-search order). When evaluating system performance, this is an important workload (sequential read)! In fact, extending your implementation into a “super grep”, and using your “super grep” to evaluate an existing system would be a final project that would be very exciting. Please talk to me if you are curious.

Helpful Commands for my-grep

In my-grep, we introduced the “FILE *” or “f” functions fopen() and fclose(). For my-grep, we recommend you continue to use those functions, but also explore: * getline(), and * strstr()

Why these functions? Well, you can definitely implement my-grep without using the suggested functions, and instead build upon open(), close(), read(), and strcmp(). But the above functions neatly implement a lot of helpful building blocks, so why not use them?

We suggest getline() because, when parsing a file, lines can be arbitrarily long (that is, you may see many many characters before you encounter a newline character, “\n”). Your my-grep should work as expected even with very long lines.

This is why we suggest that you look into the getline() library call (instead of fgets() or read()). What does getline() help with? Well, for starters, it takes care of memory management for you: you are never asked to call malloc() or free(). Second, it identifies the division of lines in arbitrary text. Nice!

If you’d like, you can check out the musl libc versions of those functions, see how they work, and “roll your own” versions as an optional extension. This can be very rewarding, but if you go down that road, I encourage you to get a fully working version first before replacing those core building blocks.

The Official my-grep Spec

Programs 3 and 4: my-(un)zip

The last programs you will build come in a pair: the first program, my-zip is a file compression tool, and the other program, my-unzip, is a file decompression tool.

These programs do not exactly mimic any existing Unix utilities. Thus, you will need to think of another strategy to verify your program’s correctness beyond comparing the output against an existing program. (Perhaps comparing the output against something else?)

The type of compression used in these programs is a simple form of compression called run-length encoding (RLE). RLE is quite simple: when you encounter a run of N consecutive identical characters, the compression tool will represent that run as the number N and a single instance of the character. When decompressing, the length should be read and then that many instances of the specified character should be written.

my-zip Overview

Perhaps the easiest way to explain my-zip behavior is through an example. Suppose we had a file with the following contents:

    aaaaaaaaaabbbb

the my-zip tool would turn it (logically) into:

    10a4b

However, the exact format of the compressed file is quite important! Here, you will write out an unsigned 4-byte integer in binary format followed by the single char. Thus, a compressed file will consist of a series of 5-byte entries, each of which is comprised of a 4-byte unsigned integer (the run length) and a single character.

Note: In the example above, the number “10” is displayed using two digits: the characters: “1” and “0”. This is not the appropriate encoding, but it is shown for demonstration purposes. Recall our unit on numeric representations in 237; how many bits does it take to represent the integer 10? The majority of the most significant bits in the 4-byte unsigned integer representation of the number “10” will be zeros. A problem is that your text editor/shell will try to interpret those bytes as text. So be careful: You will likely see “wacky” results when viewing output text-based tools!

To write out an integer in binary format (not ASCII), you should use fwrite() (again, an “f*” function, so it uses a FILE *). Read the man page for more details. For my-zip, all output should be written to standard output (the stdout file stream, like stdin, is open and accessible by default when your program starts running).

Note that, because the output is not written to a persistent file, typical usage of the my-zip tool would use shell redirection in order to divert the compressed output to a file. For example, to compress the file file.txt into a (hopefully smaller) file called file.myzip, you would type:

$ ./my-zip file.txt > file.myzip

Recall that, before writing any output from the command on the left side, the “greater than” sign tells Unix to either create a new file if one does not yet exist, or truncate the contents if a file with that name already exists.

In this example, the result is that the output from my-zip is written to the file file.myzip, which may look funky since the integers are in binary format.

my-unzip Overview

The my-unzip tool implements the inverse behavior of the my-zip tool: my-unzip takes in a RLE compressed file and writes (to standard output) the expanded results. For example, to recover the contents of file.txt that were compressed and written to the file file.myzip, you would type:

$ ./my-unzip file.myzip

Your my-unzip should read in the compressed file (likely using fread(): the complement of fwrite()) and print out the uncompressed output to standard output using printf().

The Official my-zip and my-unzip Spec

“zip” Testing Tips

There are two particularly useful tools that we can use to create and interpret outputs. The first is the program hexdump, and the second is the program /usr/bin/printf.

Let’s perform a deeper dive into the my-zip example execution that we showed above.

Suppose we had a file named ziptest.in with the following contents:

    aaaaaaaaaabbbb

Then running

   $ ./my-zip ziptest.in > ziptest.out

Would direct our my-zip output to the file ziptest.out. Surprisingly, this file contains a mix of both “printable” and “non-printable” characters. This is confirmed by running cat to display the contents of this file.

   $  cat ziptest.out
   
   ab$ 

We see a blank line, then the string ab, then our prompt (since there is no newline after ab).

This is likely not what you expected the output file to contain: based on the description above, I expected the number 10 as a 4-byte integer, followed by the character a, then the number 4 as a 4-byte integer, followed by the character b. Let’s look at the file’s contents using a flexible data intepreter called hexdump. We’ll use this tool to diagnose the situation.

   $ hexdump ziptest.out
   0000000 000a 0000 0461 0000 6200
   000000a

We see the “raw” contents of the file, printed in hexadecimal using the default hexdump formatting:

This output is helpful, but the format is a bit less intuitive than we’d like. If we read through the manual page, it shows how we can use a printf-style format string to control the way the data is displayed. The following command tells hexdump to format its output in a way that match our program’s “logic”:

   $ hexdump  -e '5/1 " %02X " "\n"' ziptest.out
    0A  00  00  00  61
    04  00  00  00  62

This output looks much better. Each line has five bytes, with each byte separated by a space. The first four bytes are our 32-bit integer count, and the last byte is our character.

You may be wondering why don’t we see the following output:

  00 00 00 0A 'a'
  00 00 00 04 'b'

The reason has to do with number formats. If we’re writing a 32-bit integer, we need to know how our machine represents integers. On an x86 platform, fread and fwrite will read/write our values in little-Endian format. So the individual bytes that make up the 4-byte integer “run length” are written from the least significant byte (LSB) to the most significant byte (MSB). This is why we see 0A (which is integer value 10) as the first byte of our file, followed by three 00 values. Here are the steps:

The mystery of the final byte (the character we are encoding) is a little less obvious. To figure out the meaning of the final byte, we need to consult an ASCII table. We see that a lowercase a, in ASCII format, is represented with the decimal value 91. This means that the hexadecimal representation of a is 61, which is what we see.

So now we have one end-to-end example of an input (aaaaaaaaaabbbb) and an output (0x0A 0x00 0x00 0x00 0x61 0x04 0x00 0x00 0x00 0x62). And hopefully, we have a concrete understanding of what an expected output should be given a given input. The next task is figuring out how to generate files with arbitrary byte patterns to construct our tests.

The printf program is one way to do this, and it is both simple and expressive enough that I encourage you to use it. We can use printf to specify and write data in binary format.

   $ printf '\x0a\x00\x00\x00\x61\x04\x00\x00\x00\x62' > output.bin

The prefix “\x” let’s us specify a byte in hexadecimal, and we can string together multiple bytes to create our target output.

Now we have the tools we need to create correctness tests for the my-zip utility. (And once we’ve verified the correctness of our my-zip, we can use the my-zip outputs as inputs to my-unzip. With the exception of a few cases noted in the spec, this approach seems like a good one!)

Evaluation

In your repository, you will find a file called Eval.md. In it, you should assess your my-cat.c, my-grep.c, my-zip.c, and my-unzip.c implementations based on correctness (which you should verify by writing tests that compare their output against standard Unix utilities, sample outputs, and/or reference implementations), code clarity (Did you write small modular functions that you compose to complete the program’s task? Did you sufficiently document your programs so that you could understand the code if you were to revisit it a year from now? Did you choose good variable names, consistently indent, and define your variables in the appropriate scope (e.g., using return values to communicate across functions rather than updating global variables)?), proper error-handling (did you check the return value of all non-printing functions, and handle success/failure appropriately), timeliness, and the adherence to the program specifications.

Although I am an important part of your target audience, your goal should be to convince yourself that your programs are complete and high quality. This is an important life skill that extends well beyond Williams or CS. If/when you get a job, your boss will not “grade” your work and return it back to you. You will be given (or asked to write) a spec and then you must complete it. Your reputation in your company or team will be affected by the work that you contribute, and this will affect the trajectory of your career. So you should feel confident in the work you submit, and you should understand the criteria by which it will be judged.

However, this class is not just about the final product. There are often things that do not show up in your git commit history. Did you take the time to get comfortable with reading man pages? Did you spend time building your gdb skills? Did you explore bash syntax to help you write creative tests? Did you overcome any challenging bugs or situations that you are proud of? Doing this takes time, and these are investments that you should be rewarded for making. (Hopefully the promise of making your life easier down the line is a pay-off, but your efforts should also be acknowledged now). So at the end of your Eval.md, you should document your experience, including the ups and downs, and reflect on how you spent your time. Convince me and convince yourself that you spent the time to learn the material.

The format of the Eval.md that I provide is a suggestion. Ultimately, how you reflect on your Lab Assignment is up to you.

Submitting Your Work

When you have completed the lab, submit your code and evaluation using the appropriate git commands, such as:

  $ git status
  $ git add ...
  $ git commit -m "final submission"
  $ git push

Verify that your changes appear on GitHub by navigating to your private repository using the web interface. It should be available at https://github.com/williams-cs/cs333lab1-{USERNAME}. You should see all changes reflected in the various files that you submit. If not, go back and make sure you committed and pushed. I will be retrieving all lab code from GitHub, so if your changes are not visible to you on GitHub, they will not be visible to me either. I want to make sure everyone receives credit for their work!

Evolving Advice (check back for updates)

Even before writing concrete tests, you can incrementally test your programs by hand. If you think about it, a bash script is a series of commands that you could have executed at the command line. Consider running your programs and using shell redirection (>) to write the output to a temporary file. If you do the same with the standard cat and grep programs, you can compare the outputs.

The diff tool is incredibly helpful for comparing two files; it gives much more information in its output than cmp (although the yes/no nature of cmp makes it ideal for certain tests). With diff, you will quickly see whether your behavior matches exactly, and if not, you can see how your output differs.

For my-zip and my-unzip, you may wish to use diff to compare the uncompressed version of a file against the original; although that does not test every case that my-zip and my-unzip must handle (e.g., multiple files as input), a well organized implementation of a single-file-case test could be generalized to a test that invokes the multi-file cases.

Start early on my-cat.c. The remaining three programs have a similar foundation, so completing my-cat.c as early as possible will help you to create a budget/schedule for your remaining time.

Stop by my office hours, the TA help hours, and ask questions on slack as soon as you get stuck.

Please collaborate as much as possible under the allowed guidelines. Do whatever helps you learn the material best.

You should be able to use gdb on your code. If you are unfamiliar with gdb, Google and your instructor/TA/classmates can help. This GDB info page from Harvey Mudd and this GDB quick reference from the Univ of Texas may also be very helpful. REMEMBER TO COMPILE YOUR CODE WITH -g IF YOU WANT TO RUN IT IN A DEBUGGER!!

Possible Extensions

There are many interesting extensions that you could attempt if you want to add more functionality to your programs. (Note that if you attempt these extensions, it may alter the spec, which is OK!)

Many programs interpret “flags” passed in from the command line in order to affect their behavior. The cat and grep utilities are among them.

Adding support for “flags” is something that can be done using the getopt() or getoptlong() library functions. There is even more documentation elsewhere, with examples.

I also encourage you to look at the man pages for grep in particular, and see if there are interesting features you’d like to try out. For instance:

If you attempt any extensions, be sure to document them in your Eval.md file and adjust your assessment to match your deeper engagement with the material.


Acknowledgments

This contents of this lab borrow from labs written by Remzi and Andrea Arpaci-Dusseau as part of their OSTEP course materials.