Spring 2021
This assignment serves several purposes, and as a result, it may pose different challenges to each of us. After completing the tasks in this assignment, we will:
In this assignment, we will be implementing our own versions of important Unix utilities. These are useful utilities that we should all familiarize ourselves with for use in our everyday programming lives, but they are also utilities that may come in handy if you want to analyze/benchmark the performance of real systems (one possibility for a final project may be to extend some of these utilities in interesting/useful ways). Your current task is to write some re-imagined versions of the
programs.
Along the way, we will also become familiar with writing tests to verify that our programs work as expected. Comprehensive testing is incredibly important when writing complex programs: although we often work on small independent features, many system components use others as building blocks, and they depend upon those other components’ correctness. We can’t build an effective house when the foundation has cracks, and it is much harder to fix a problem after the fact.
For this assignment, you will submit your own versions of all code. Each student will be given a private repository under the Williams-CS GitHub organization, and only you will have read/write access to that repository. You should be committing your code to this repository as you make progress. I highly recommend that you commit code early and often: it will only help because the teaching staff can view your code and easily answer questions using the GitHub interface.
Although you must write all code yourself, you are not expected to work in isolation. You are permitted to discuss the assignment with classmates under the following guidelines.
You may discuss:
man
pages, make
/Makefiles, Unix utilities) and how to use themsudo
may be dangerous.you may verbally discuss any C library functions that were useful for particular tasks and the reasons you used them
you may verbally discuss general algorithms for particular tasks, or share non-persistent diagrams/pseudocode of algorithms. The litmus test of understanding a concept is that you can recreate your own copy from memory after the conversation ends.
you may verbally discuss your strategies for verifying your program’s correctness. For example,
“I wrote a script that runs my
my-cat.c
program, writes the output to a file, and then uses the Unixdiff
utility to compare my program’s output to the Unixcat
utility’s output.”
would a be perfectly fine conversation to have with a classmate.
There is no “starter code” for this assignment, but you will find empty .c
files that you will fill in with your assignment solutions. In addition to the four empty .c
files, each repository will contain a README.md
file with additional links and assignment details, and an Eval.md
file that you will complete as described below.
You will submit five individual C files (they should end in the .c
extension), one for each operation/task. Each file should have a main
function that, when run, performs the behavior described for that task. In addition, there should be a corresponding “test script” for each C program. This will likely be a “bash script”, but there are other approaches you may wish to explore (e.g., a python script). You repository’s README.md
is the place to describe how to run your programs and your program tests.
In addition to your code, you should complete a self-evaluation in Eval.md
that reflects on your assignment. More details are included at the end of this document. If you have any questions about this part of the assignment or how it will be used, please ask.
libc
There are several C standard library (often called libc
) routines that will be particularly helpful for implementing the programs in this assignment. All C code is automatically linked with “the C standard library”, which is full of useful functions. You can learn more about the C library here.
There is a lot of code at your disposal thanks to the C library, so at various points you will need to consult the documentation. Sometimes documentation alone is not enough to understand the implementation details, and looking at the source code fills in the gaps. One slightly more readable version of C library code is the musl libc project. If you ever want to explore implementation details, or perhaps even modify some libc functionality for your own purposes, I encourage you to read and play around with the musl libc source code.
Although we will walk through this part of the assignment together in class, you should independently complete this portion of the task yourself and execute the commands as practice. We’ve included detailed instructions so that you can recreate the steps, and you may refer as often as you’d like to your notes from class.
Applications rarely call system calls directly; instead, applications often call libc library functions that internally package arguments and correctly invoke the actual system call’s “magic incantation” (each OS may define its own protocols for invoking system calls), which ultimately passes control to the operating system kernel to perform the requested task on the application’s behalf. In this first task, we will use the stat()
libc library function to ask the operating system for a single file’s details, and then we will display those details similarly to the way that Unix stat
utility presents them.
For this task (and many others like it), our first inclination should be to read the documentation. In particular, we want to understand the behavior of the stat()
routine. To do this, open a terminal and type:
man stat $
This command immediately opens the manual page inside the terminal, but don’t fret: when you exit the manual, the terminal state that you just left will be restored.
On Ubuntu 20.04, the top of my terminal reads:
STAT(1) User Commands STAT(1)
NAME
stat - display file or file system status
SYNOPSIS
stat [OPTION]... FILE...
DESCRIPTION
Display file or file system status.
...
ASIDE: Note that on macOS, the man page is slightly different:
STAT(1) BSD General Commands Manual STAT(1) NAME readlink, stat -- display file status SYNOPSIS stat [-FLnq] [-f format | -l | -r | -s | -x] [-t timefmt] [file ...] readlink [-n] [file ...] DESCRIPTION The stat utility displays information ...
This is important to realize because, although we often think of “Unix-like” operating systems as interchangeable, there are both subtle and not-so-subtle differences that occasionally pop up. This is one of the reasons we spent so much time setting up a uniform environment during Lab 0.
Effectively navigating man
pages takes practice, both to understand the format and to navigate the interface. The command `man stat`
at the command line will open up the manual page in the same interface as the less
command. Use the up/down arrows to navigate and type the single letter q
to quit. In fact, I suggest using the command `man less`
to explore how to navigate man
pages! Your time investment will quickly pay off. (Also note: many shortcuts in less
are also shortcuts in emacs
and at the command line, so being familiar with Unix tools and keyboard shortcuts will make you more efficient in all aspects of systems programming!)
Notice that the first line starts with “STAT(1)
”. The Unix manual is broken into several sections. The 1st section describes general commands, the 2nd section describes system calls, and the 3rd section describes library functions (including libc). You can specify a section when searching the manual to ensure that you get the documentation that you want. The syntax is `man <SECTION_NUMBER> <FUNCTION_NAME>`
). So although `man stat`
opens a useful manual page, it isn’t the page we want just yet. Instead, we want to look at the stat()
library function.
To open the stat()
library function’s man page, execute:
man 2 stat $
On Ubuntu 20.04, my terminal reads:
STAT(2) Linux Programmer's Manual STAT(2)
NAME
stat, fstat, lstat, fstatat - get file status
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
int stat(const char *pathname, struct stat *statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *pathname, struct stat *statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *pathname,
struct stat *statbuf, int flags);
Feature Test Macro Requirements for glibc (see
feature_test_macros(7)):
lstat():
/* glibc 2.19 and earlier */ _BSD_SOURCE
|| /* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the
buffer pointed to by statbuf. No permissions are required
on the file itself, but—in the case of stat(), fstatat(), and
lstat()—execute (search) permission is required on all of the
directories in pathname that lead to the file.
...
We can see that the STAT(2)
manual page contains entries for multiple related library functions, including fstat()
, lstat()
, fstatat()
, and stat()
. The SYNOPSIS
section shows the parameters and return types for each. We also see that some C library header files (sys/stat.h
, sys/types.h
, and unistd.h
) must be included in order to call any of these functions. There are many other useful sections, including DESCRIPTION
, which gives high-level information about each function, RETURN VALUES
, which explains how to interpret the various return values (this is particularly important since many functions we will use have values to signal complete AND partial success as well as a variety of failure modes). The SEE ALSO
section often helps when the function is almost what you’re looking for, but not quite the solution you need.
Scroll down and skim the various sections to get a sense for the information they contain. Manual pages have a relatively standardized format (most follow the same conventions), so this exploration will definitely pay dividends later.
stat
programLet’s use the information from manual page to help us write a program. We’ll use this program for two things.
In order to do this, we will create a specification that is some contrived and elaborate, but we’ll keep an eye towards exploring concepts we’ve encountered in our first unit.
We will write a short utility that takes a single argument (a file’s path), calls stat()
, and prints specific information about the file specified. In particular, we will print the file’s name, inode number, link count, and size. These are pretty important fields, but they are a subset of the total information that the stat()
library function/system call gives. Checking the man page shows how much information there actually is…
Note: to round out a real spec, we’d need to define the expected output format—on both success and failure—so that we can write our tests to verify we conform to the spec exactly. We’ll do that for later programs.
We start by creating a file called my-stat.c
and including the following code:
#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
void usage() {
"Please provide a pathname as input.\n");
printf("Example usage:\n");
printf("\t./my-stat stat.c\n");
printf(
}
/**
* Function that calls the stat library function on a single path, and
* if that path exists, outputs select file information.
*/
int do_stat(char *path) {
int ret;
// We allocate a `struct stat` on the stack for two reasons
// 1) we only need to access the contents inside this function
// 2) when we return from do_stat, the memory is automatically freed
struct stat statbuf;
// we call stat on our path, and pass the address of our struct stat
// variable so that the function can fill in the information it needs
// This is a very common pattern in systems code: the thing we "return"
// is a status code (success/failure), and the relevant information is
// written into some structure in memory
ret = stat(path, &statbuf);if (ret < 0) {
"stat error:");
perror(return errno;
}
// We print our information in the same format that the `stat` program
// does. This is a contrived example; we just do this to make writing
// our "correctness test" easier.
"name:\t%s\n", path);
printf("ino:\t%lu\n", statbuf.st_ino);
printf("nlink:\t%lu\n", statbuf.st_nlink);
printf("size:\t%lu\n", statbuf.st_size);
printf(
return ret;
}
int main(int argc, char **argv) {
if (argc < 2) {
usage();
exit(errno);
}
return do_stat(argv[1]);
}
Let’s take a minute to examine the code.
stat()
function itself. When including files in C, specifying files with brackets, e.g., #include <name.h>
, tells the compiler to look on the “include path”, which you can expand by passing the -I
flag to gcc
. If you don’t use brackets, e.g., #include "name.h"
, then gcc
looks for the file in the current working directory.usage()
function. This is common. We often start by checking for the correct number and types of arguments, and if there is a mismatch, we stop our program and let the user know what went wrong. Often there are very specific requirements for our program output. It is important to follow the program specification exactly because scripts that use our programs may rely on the spec for their own correctness. Here, we are writing a novel program so we can do what we want; I just made the output informative.do_stat()
function. Make your programs modular! In C, this is even more important because we have to manage our own memory, and small functions helps us to reason about the lifetime/scope of variables.statbuf
has type struct stat
, which isn’t defined anywhere in our program. That is because struct stat
is defined in <sys/stat.h>
. The structure’s fields are described in the stat()
manual page, so look there for more details. Notice that we only print a subset of the fields; we chose these fields because they came up in our first unit.printf
) functions. We need to catch errors early so we can reason about their source and stop our program’s execution before we propagate bugs. For example, suppose we were modifying file data; we wouldn’t want to continue our program if we realized that we gave the wrong file name: we might clobber important file contents!Now that we’ve examined the code, try compiling your program using gcc
as follows:
gcc -o my-stat my-stat.c -Wall -Werror $
This will make a single executable binary called my-stat
, which you can then run as shown above (./my-stat <filename>
). The -Wall
part of the command says to print all warnings, and the -Werror
part says to treat all warnings as errors. In other words, you should write a program that follows all of the recommended rules.
You should be able to execute your program as follows:
./my-stat my-stat.c
$ name: my-stat.c
ino: 2098095
nlink: 1
size: 1491
You may see different outputs on your own file system, but try comparing it with the output of `$ stat my-stat.c`
.
stat
testNow that we’ve written a program, we want to test it. In the case of our my-stat
program, it is probably excessive to write a test that is as long as the program itself, but we’ll use this basic test as an example to explore the tools and techniques we’ll use for testing some of the more complex programs that we’ll write later.
The first question we want to ask ourselves is: What do we want to test?
Typically, we want to verify that our program behaves correctly when our program is run correctly, but we also care about the behavior of our program when there are errors. We’ll skip error cases for now, but we’ll handle them in our later programs.
Note: we don’t have a formal spec for
my-stat
, so we can’t be as thorough as we will ask you to be later in this assignment.
So for our first test, we will verify that our program behaves as expected under correct usage, and in doing so, we’ll use some techniques that we can later expand upon when we want to be thorough.
Create a file called stat-test.sh
, and include the following code:
#!/bin/bash
# This test runs both the `stat` utility and our sample "stat-like" program.
# The test uses file redirection to write the outputs to files,
# which it then compares to verify that the outputs match.
# We declare and "hard-code" some variables so we can easily run our program
# without taking arguments.
FILE="my-stat.c" # the file we will "stat"
STAT_OUT="${FILE}.stat.log" # where we will place "correct" output
MY_STAT_OUT="${FILE}.my-stat.log" # where we will write our "test" output
# An example bash function.
# This function is called "test_existence".
# It takes a single argument: the file whose existence to test for.
# If the file DOES exist, it exits the script with a helpful message
test_existence() {
if [ -f "$1" ]; then
echo "ERROR: \"$1\" exists. Please clean up before running test."
exit 0
fi
}
## Start our test by making sure our "output files" don't exist already.
## We do this because we don't want our test to be influenced by leftover
## data from previous runs.
test_existence "${STAT_OUT}"
test_existence "${MY_STAT_OUT}"
## We'll use the Unix stat utility as a baseline.
## We extract the relevant info and append the output to a file.
## (Note that one `>` first clears the target file, then writes to it
## whereas two `>>` appends to the end of the target file.)
stat --printf="name:\t%n\n" ${FILE} > ${STAT_OUT}
stat --printf="ino:\t%i\n" ${FILE} >> ${STAT_OUT}
stat --printf="nlink:\t%h\n" ${FILE} >> ${STAT_OUT}
stat --printf="size:\t%s\n" ${FILE} >> ${STAT_OUT}
## Now do the same for our program: append its output to a file
./my-stat ${FILE} > ${MY_STAT_OUT}
## to check the return value of the most recent command,
## we can use the special $? variable. Note that the value
## of $? changes after every command, so we may want to
## save it in a variable so that we can refer to it
## multiple times (since using the value $? often means
## calling an expression, which results in the value of $?
## changing to reflect the return value of *that* expression)
ret=$? # store the return value of `./my-stat ${FILE} > ${MY_STAT_OUT}`
if [ $ret -ne 0 ] ; then
echo "return value does not match expected value."
echo "expected: 0, received: $ret"
exit 1
fi
## now use cmp (or diff) to compare
## (use `man cmp` or `man diff` for more details)
## a return value of 0 means there was no difference.
## Important note: bash treats 0 as true and all other
## values as false!
## (We could also use the trick above to store the return value
## of cmp -s ${STAT_OUT} ${MY_STAT_OUT}, and then compare it
## against 0)
if cmp -s ${STAT_OUT} ${MY_STAT_OUT} ; then
echo "success!"
# Since the file succeeded, let's clean up our log files
rm ${STAT_OUT} ${MY_STAT_OUT}
else
echo "output does not match!"
# leave log files around for debugging
fi
Although there are many comments throughout, we’ll emphasize some details here and fill in some gaps.
First, some details about bash syntax and executable files. The first line you see is #!/bin/bash
. In bash, the hashtag (#
) is the comment character. If the first line starts with a “hash bang” (#!
), the following path specifies what shell to use. In many Unix-like OSes, the /bin/
directory is where many binary executables are located when you install them, and bash is the name of our system’s default shell.
In bash, variable assignment is done without spaces between the variable name, equals sign, and value.
In bash, we can yield the value of a variable using $VARNAME
or ${VARNAME}
. I often prefer ${VARNAME}
so that I can unambiguously combine the variable with other strings, like ${FILE}.stat.log
, which combines the value of the ${FILE}
variable with the string “.stat.log
” to create a formatted file name.
Function definitions do not include a set of parameters inside the ()
, but you can access any parameters that were passed when calling the function by using numbered variables ($1
is the first var, $2
is the second, etc.).
Conditional statements take many forms. Inside test existence, the conditional
if [ -f "$1"]; then
checks whether the value of the function’s first argument ("$1"
) is the pathname of an existing file (using -f
inside [ ]
returns TRUE
if the target exists and is a regular file). Here is a good bash reference with a linked table of contents that makes searching for the correct syntax for your target task easier.
We call a function by specifying its name followed by 0 or more arguments. When combined with the lack of specification of arguments in the function definition, we see that there isn’t any type checking or enforcement of argument counts! This is why we don’t often write long or security-critical programs as bash scripts.
We can call other programs as part of our script. For instance, we see the line:
stat --printf="name:\t%n\n" ${FILE} $
calls the stat
program and then provides some arguments, some of which may be “hard coded” and others may be variables. To specify that we want to call a program that is in the current directory, like the executable my-stat
, we can specify that with “./
” e.g., ./my-stat
. Recall that a single dot (.
) refers to the current directory.
Every bash command/program yields a return value. We can use this to determine if commands succeed or not. For example, the cmp
utility compares the contents of two files. If we read the manual page for cmp
, we can check the return value’s meaning. Conditionally executing code in reaction to the return values of functions/utilities is a very useful strategy that we will utilize in all of our tests!
The >
and >>
symbols let us do “input/output redirection”. The command on the left side is run, and its output (both standard output and error output) is written to the file specified on the right side (which is created if the file does not yet exist). When using one >
, the right side file is truncated (its size is trimmed to 0), and the the output is written to that now empty file. When using two >>
, the right side file is left in tact, and the output is appended at the end.
Now we have two things:
my-stat.c
(and a compiled version called my-stat
)stat-test.sh
Let’s try them out. We’ve compiled and run my-stat.c
, so let’s try the test. Type:
./stat-test.sh
$ bash: ./stat-test.sh: Permission denied
This doesn’t work! Why is that? Let’s get meta, and examine the stat-test.sh
’s stat
output.
stat stat-test.sh
$ File: stat-test.sh
Size: 2020 Blocks: 8 IO Block: 4096 regular file
Device: 811h/2065d Inode: 2097323 Links: 1
Access: (0664/-rw-rw-r--) Uid: ( 1000/ bill) Gid: ( 1000/ bill)
Access: 20XX-XX-XX 12:19:12.588338617 -0500
Modify: 20XX-XX-XX 12:19:04.188607737 -0500
Change: 20XX-XX-XX 12:19:04.188607737 -0500
Birth: -
In the first “Access” line, we see the string (0664/-rw-rw-r--)
. The left side of the /
(i.e., 0664
) shows the file permissions in octal format. The last three digits represent the permissions granted to the file’s owner/user, the file’s group, and the world, respectively. The right size of the /
(i.e., -rw-rw-r--
) shows a more visual representation of those permissions (each octal digit on the left corresponds to three bits, and those bits represent the read (r
), write (w
), and executable (x
) permissions). Note that there are no x
’s in the permissions. We need to make our script executable!
We can change the mode (which includes the permissions) of a file using the chmod
utility. How? Look at the man pages! I like to use the incremental approach, so I would type:
chmod u+x stat-test.sh $
to give the user execute permissions. Now if I stat the file:
stat stat-test.sh
$ File: stat-test.sh
Size: 2020 Blocks: 8 IO Block: 4096 regular file
Device: 811h/2065d Inode: 2097323 Links: 1
Access: (0764/-rwxrw-r--) Uid: ( 1000/ bill) Gid: ( 1000/ bill)
Access: 20XX-XX-XX 12:19:12.588338617 -0500
Modify: 20XX-XX-XX 12:19:04.188607737 -0500
Change: 20XX-XX-XX 12:19:04.188607737 -0500
Birth: -
I see that I can now execute it, which I’ll do by typing:
./stat-test.sh
$ success!
Hooray!
Now that we’ve written and (loosely) tested a program together, the rest of the assignment asks you to replicate this process for incrementally more interesting programs. Some of these programs mimic existing Unix utilities, and for those programs, we can use similar strategies that we used above in order to compare our output against a working target. But unlike we did for our “stat” example, we need to be sure that our programs faithfully match a program specification. This includes the behavior of correct as well as incorrect executions; in other words, your error messages should match the expected error messages too.
my-cat
Our first program, which we will call my-cat
, is a simplified version of the standard Unix program cat
. At a high level, cat
reads the file specified by the user and prints that file’s exact contents to standard output.
In the following example of my-cat
usage, where a user wants to see the contents of the file my-stat.c
, the user would pass my-stat.c
as the only command-line argument, and the exact contents my-stat.c
would be printed:
./my-cat my-stat.c
$ #include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
void usage() {
...the rest of my-stat.c is also printed to standard output...
Like we saw above, the “./
” before the my-cat
above is a Unix thing; it tells the system which directory to find my-cat
in. (Recall that the “.
” (single dot) is a synonym for the current working directory. Since “/
” is the character used to separate different components of a pathname, “./my-cat
” means look for the file my-cat
inside the current directory).
To create the executable my-cat
binary, you’ll be editing a single source file, my-cat.c
, and writing C code that implement this simplified version of the cat
utility. To compile the program, you can execute the following command:
gcc -o my-cat my-cat.c -Wall -Werror $
This will make a single executable binary called my-cat
, which you can then run as shown above (./my-cat <filename>
). The -Wall
part of the command says to print all compiler warnings, and the -Werror
part says to treat all compiler warnings as errors. In other words, these flags force us to write a program that follows all of the recommended rules.
my-cat
In our first unit, we went over a subset of the system calls that comprise the “File System API”. There are other ways to interact with files and your file system than those system calls. (Although, when “the rubber meets the road”, the other APIs ultimately call the FS API system calls internally.)
For my-cat
, we recommend using the following C library routines for file input and output: * fopen()
* fgets()
, and * fclose()
As we’ve emphatically stressed above, the first thing you should when encountering a new function is to read the function’s manual page. (The above functions are all “library functions”, so they are in manual section 3 if you need to disambiguate.)
So, to access the man page for fopen()
, you could type the following at your UNIX terminal prompt:
man fopen $
Although you should definitely explore the man pages, we will also give a short overview of important commands here.
The fopen()
function “opens” a file (similarly to open
), except it doesn’t yield an integer file descriptor. Instead, the fopen()
command yields a pointer to a FILE
struct, which can then be passed to other library routines that read, write, etc. Most of these library functions start with an “f” (as we see with fopen()
, fclose()
, fgets()
).
Here is a short example of fopen()
usage:
FILE *fp = fopen("main.c", "r");
if (fp == NULL) {
"cannot open file\n");
printf(
perror();1);
exit( }
A couple of points here:
First, note that fopen()
takes two arguments: the name of the file and the mode. These arguments are used the same way that they are in the open()
function. The mode restricts the ways that the FILE
structure can be used to interact with the file. In this case, because we only wish to read the file, we pass "r"
as the second argument. Read the man pages to see what other options are available. In general, you should always grant the minimum privileges necessary to perform your task; this eliminates common errors by letting you state your intent, and giving C the information it needs to enforce your intent.
Second, note the critical step of checking whether the fopen()
function actually succeeded. Unlike Java, C will not throw an exception when things go wrong; rather, C expects that (in good programs, i.e., the only kind you’d want to write) you always will check whether the call succeeded. Reading the manual page tells you the details of what is returned on success (often a range of values indicate success, since partial successes may be possible) and on error (often -1, with an appropriate errno
value set to indicate why the function failed); in this case, the Ubuntu 20.04 manual page says:
Upon successful completion fopen(), fdopen(), and freopen()
return a FILE pointer. Otherwise, NULL is returned
and errno is set to indicate the error.
Thus, as the code above does, please check that fopen()
does not return NULL
before trying to use the FILE
pointer it returns.
Third, note that when the error case occurs, the program prints a message and exits with error status of 1. In Unix systems, it is traditional to return 0 upon success, and return non-zero upon failure. Here, we will use 1 to indicate failure.
Aside: if
fopen()
does fail, there are many possible reasons why. You can use the functionsperror()
and/orstrerror()
to print human-readable messages that describe why the error occurred; learn about those functions on your own (using … you guessed it … the man pages!), and use them in your assignment!
Once a file is opened, there are many different ways to read from it. The one we suggest that you use is fgets()
. fgets()
reads data from files into a user-supplied buffer until it reaches the specified limit or the end of the file. You could implement this same functionality using open()
/read()
/close()
, but we’ll be using other “f” functions later, so we might as well practice.
printf()
. For example, after reading in a line with fgets()
into a variable buffer
, you can just print out the buffer as follows:
"%s", buffer); printf(
Note that you should not add a newline ("\n"
) character to the printf()
format string, because then your output would diverge from the file’s true contents. Just print the exact contents of the read-in buffer (which will likely include newlines).
Finally, when you are done reading and printing, use fclose()
to close the file, thus indicating you no longer need to read from it.
my-cat
SpecYour program my-cat
should be invokable with zero or more filename arguments on the command line, and my-cat
should print out the contents of each file in succession.
In all non-error cases, my-cat
should exit with status code 0 (This is usually done by returning a 0 from main()
or by calling exit(0)
).
If your program tries to fopen()
a file and fails, my-cat
should print the exact message:
my-cat: cannot open file
(followed by a newline), and exit with status code 1.
If multiple files are specified on the command line, the files should be printed out, in order, until all files are successfully printed OR an error opening a file is reached (at which point the above error message is printed and my-cat
exits). There should be nothing extra printed before, after, or in between files.
If no files are specified on the command line, my-cat
should exit and return 0 without printing anything. Note that this behavior is slightly different than the behavior of normal Unix cat
, so be careful when writing your correctness tests for the case when there are no files provided.
my-grep
The second utility you will build is called my-grep
, a variant of the Unix tool grep
. At a high level, grep
scans through a file, line by line, trying to find a user-specified search term in each line. If a line contains the search term, that line is printed out; otherwise the line is skipped. Then grep
continues searching subsequent lines.
Standard grep
is quite customizable with many command line many options to tweak its behavior. Your my-grep
will just implement the basic functionality described above. Here is how a user would look for the term “foo
” in the file bar.txt
:
./my-grep foo bar.txt
$ this line has foo in it
so does this foolish line; do you see where?
even this line, which has barfood in it, will be printed.
This may seem like a silly thing to implement, or even a waste of time. I assure you that it is not. I (Bill) have personally used Unix grep
as a system evaluation benchmark in multiple published papers. The grep
utility implements a linear scan through a file (or if passed the -r
flag, a recursive linear scan through a directory subtree in modified breadth-first-search order). When evaluating system performance, this is an important workload (sequential read)! In fact, extending your implementation into a “super grep”, and using your “super grep” to evaluate an existing system would be a final project that would be very exciting. Please talk to me if you are curious.
my-grep
In my-grep
, we introduced the “FILE *
” or “f” functions fopen()
and fclose()
. For my-grep
, we recommend you continue to use those functions, but also explore: * getline()
, and * strstr()
Why these functions? Well, you can definitely implement my-grep
without using the suggested functions, and instead build upon open()
, close()
, read()
, and strcmp()
. But the above functions neatly implement a lot of helpful building blocks, so why not use them?
We suggest getline()
because, when parsing a file, lines can be arbitrarily long (that is, you may see many many characters before you encounter a newline character, “\n
”). Your my-grep
should work as expected even with very long lines.
This is why we suggest that you look into the getline()
library call (instead of fgets()
or read()
). What does getline()
help with? Well, for starters, it takes care of memory management for you: you are never asked to call malloc()
or free()
. Second, it identifies the division of lines in arbitrary text. Nice!
If you’d like, you can check out the musl libc versions of those functions, see how they work, and “roll your own” versions as an optional extension. This can be very rewarding, but if you go down that road, I encourage you to get a fully working version first before replacing those core building blocks.
my-grep
SpecIf your program my-grep
is run with any arguments, it should interpret the first argument as a search term. Each successive argument will specify an individual target file to grep through (0 or more target files may be provided).
For each file, your my-grep
should go through each line of that file and check whether the search term is present in the line; if the search term is present, the entire line containing the search term should be printed, and if not, the line should be skipped.
The matching should be case sensitive. Thus, if searching for “foo
”, lines with “Foo
” do not match. This will help simplify your implementation, and it matches default grep
behavior. (See possible extensions at the end).
If my-grep
is passed zero command-line arguments, the program should print the exact message:
my-grep: searchterm [file …]
(followed by a newline), and exit with status 1.
If my-grep
encounters a file that it cannot open, it should print the exact message
my-grep: cannot open file
(followed by a newline) and exit with status 1.
In all other cases, my-grep
should exit with return code 0.
If a search term, but no file, is specified, my-grep
should work, but instead of reading from a file, my-grep
should read from standard input. Doing so is easy, because the file stream stdin
is, by default, already open and accessible by that name when you run your program; you can use fgets()
(or similar “f*” routines) to read from it. The man pages should help with this.
For simplicity, if passed the empty string as a search string, e.g.,
my-grep "" file1 file2 ... fileN $
my-grep
can either match NO lines or match ALL lines; both are acceptable. Document your chosen behavior in your README.md
file.
my-(un)zip
The last programs you will build come in a pair: the first program, my-zip
is a file compression tool, and the other program, my-unzip
, is a file decompression tool.
These programs do not exactly mimic any existing Unix utilities. Thus, you will need to think of another strategy to verify your program’s correctness beyond comparing the output against an existing program. (Perhaps comparing the output against something else?)
The type of compression used in these programs is a simple form of compression called run-length encoding (RLE). RLE is quite simple: when you encounter a run of N consecutive identical characters, the compression tool will represent that run as the number N
and a single instance of the character. When decompressing, the length should be read and then that many instances of the specified character should be written.
my-zip
OverviewPerhaps the easiest way to explain my-zip
behavior is through an example. Suppose we had a file with the following contents:
aaaaaaaaaabbbb
the my-zip
tool would turn it (logically) into:
10a4b
However, the exact format of the compressed file is quite important! Here, you will write out an unsigned 4-byte integer in binary format followed by the single char
. Thus, a compressed file will consist of a series of 5-byte entries, each of which is comprised of a 4-byte unsigned integer (the run length) and a single character.
Note: In the example above, the number “10” is displayed using two digits: the characters: “1” and “0”. This is not the appropriate encoding, but it is shown for demonstration purposes. Recall our unit on numeric representations in 237; how many bits does it take to represent the integer 10? The majority of the most significant bits in the 4-byte unsigned integer representation of the number “10” will be zeros. A problem is that your text editor/shell will try to interpret those bytes as text. So be careful: You will likely see “wacky” results when viewing output text-based tools!
To write out an integer in binary format (not ASCII), you should use fwrite()
(again, an “f*” function, so it uses a FILE *
). Read the man page for more details. For my-zip
, all output should be written to standard output (the stdout
file stream, like stdin
, is open and accessible by default when your program starts running).
Note that, because the output is not written to a persistent file, typical usage of the my-zip
tool would use shell redirection in order to divert the compressed output to a file. For example, to compress the file file.txt
into a (hopefully smaller) file called file.myzip
, you would type:
./my-zip file.txt > file.myzip $
Recall that, before writing any output from the command on the left side, the “greater than” sign tells Unix to either create a new file if one does not yet exist, or truncate the contents if a file with that name already exists.
In this example, the result is that the output from my-zip
is written to the file file.myzip
, which may look funky since the integers are in binary format.
my-unzip
OverviewThe my-unzip
tool implements the inverse behavior of the my-zip
tool: my-unzip
takes in a RLE compressed file and writes (to standard output) the expanded results. For example, to recover the contents of file.txt
that were compressed and written to the file file.myzip
, you would type:
./my-unzip file.myzip $
Your my-unzip
should read in the compressed file (likely using fread()
: the complement of fwrite()
) and print out the uncompressed output to standard output using printf()
.
my-zip
and my-unzip
SpecCorrect invocation should pass one or more file names via the command line to the program.
If no files are specified, the program should exit with return code 1
and print the message:
my-zip: file1 [file2 …]
(followed by a newline) for my-zip
, OR
my-unzip: file1 [file2 …]
(followed by a newline) for my-unzip
.
If the program encounters a file that it cannot open, the program should exit with return code 1
print the message
my-zip: cannot open file
(followed by a newline) for my-zip
, OR
my-unzip: cannot open file
(followed by a newline) for my-unzip
The format of the file compressed by my-zip
must match the description above exactly (i.e., each run is represented as a 4-byte unsigned integer followed by a character).
If multiple files are passed to my-zip
, they are compressed into a single compressed output, and when unzipped, they will turn into a single uncompressed stream of text (thus, the information that multiple files were originally input into my-zip
is lost). The same thing holds for my-unzip
. You do not need to handle runs that cross file boundaries, although you may do so if you wish.
There are two particularly useful tools that we can use to create and interpret outputs. The first is the program hexdump
, and the second is the program /usr/bin/printf
.
Let’s perform a deeper dive into the my-zip
example execution that we showed above.
Suppose we had a file named ziptest.in
with the following contents:
aaaaaaaaaabbbb
Then running
./my-zip ziptest.in > ziptest.out $
Would direct our my-zip
output to the file ziptest.out
. Surprisingly, this file contains a mix of both “printable” and “non-printable” characters. This is confirmed by running cat
to display the contents of this file.
cat ziptest.out
$
ab$
We see a blank line, then the string ab
, then our prompt (since there is no newline after ab
).
This is likely not what you expected the output file to contain: based on the description above, I expected the number 10
as a 4-byte integer, followed by the character a
, then the number 4
as a 4-byte integer, followed by the character b
. Let’s look at the file’s contents using a flexible data intepreter called hexdump
. We’ll use this tool to diagnose the situation.
hexdump ziptest.out
$ 0000000 000a 0000 0461 0000 6200
000000a
We see the “raw” contents of the file, printed in hexadecimal using the default hexdump
formatting:
0000000
and 000000a
are the offset of those lines in the file).This output is helpful, but the format is a bit less intuitive than we’d like. If we read through the manual page, it shows how we can use a printf-style format string to control the way the data is displayed. The following command tells hexdump
to format its output in a way that match our program’s “logic”:
hexdump -e '5/1 " %02X " "\n"' ziptest.out
$ 0A 00 00 00 61
04 00 00 00 62
5/1
part of the format string says to apply the pattern to groups of five bytes" %02X "
part says to print 2-digit hexadecimal values separated by spaces"\n"
.This output looks much better. Each line has five bytes, with each byte separated by a space. The first four bytes are our 32-bit integer count, and the last byte is our character.
You may be wondering why don’t we see the following output:
00 00 00 0A 'a'
00 00 00 04 'b'
The reason has to do with number formats. If we’re writing a 32-bit integer, we need to know how our machine represents integers. On an x86 platform, fread
and fwrite
will read/write our values in little-Endian format. So the individual bytes that make up the 4-byte integer “run length” are written from the least significant byte (LSB) to the most significant byte (MSB). This is why we see 0A
(which is integer value 10) as the first byte of our file, followed by three 00
values. Here are the steps:
A
0
s, we get 0000000A
0A
00
00
00
.The mystery of the final byte (the character we are encoding) is a little less obvious. To figure out the meaning of the final byte, we need to consult an ASCII table. We see that a lowercase a
, in ASCII format, is represented with the decimal value 91. This means that the hexadecimal representation of a
is 61
, which is what we see.
So now we have one end-to-end example of an input (aaaaaaaaaabbbb
) and an output (0x0A 0x00 0x00 0x00 0x61 0x04 0x00 0x00 0x00 0x62
). And hopefully, we have a concrete understanding of what an expected output should be given a given input. The next task is figuring out how to generate files with arbitrary byte patterns to construct our tests.
The printf
program is one way to do this, and it is both simple and expressive enough that I encourage you to use it. We can use printf
to specify and write data in binary format.
printf '\x0a\x00\x00\x00\x61\x04\x00\x00\x00\x62' > output.bin $
The prefix “\x
” let’s us specify a byte in hexadecimal, and we can string together multiple bytes to create our target output.
Now we have the tools we need to create correctness tests for the my-zip
utility. (And once we’ve verified the correctness of our my-zip
, we can use the my-zip
outputs as inputs to my-unzip
. With the exception of a few cases noted in the spec, this approach seems like a good one!)
In your repository, you will find a file called Eval.md
. In it, you should assess your my-cat.c
, my-grep.c
, my-zip.c
, and my-unzip.c
implementations based on correctness (which you should verify by writing tests that compare their output against standard Unix utilities, sample outputs, and/or reference implementations), code clarity (Did you write small modular functions that you compose to complete the program’s task? Did you sufficiently document your programs so that you could understand the code if you were to revisit it a year from now? Did you choose good variable names, consistently indent, and define your variables in the appropriate scope (e.g., using return values to communicate across functions rather than updating global variables)?), proper error-handling (did you check the return value of all non-printing functions, and handle success/failure appropriately), timeliness, and the adherence to the program specifications.
Although I am an important part of your target audience, your goal should be to convince yourself that your programs are complete and high quality. This is an important life skill that extends well beyond Williams or CS. If/when you get a job, your boss will not “grade” your work and return it back to you. You will be given (or asked to write) a spec and then you must complete it. Your reputation in your company or team will be affected by the work that you contribute, and this will affect the trajectory of your career. So you should feel confident in the work you submit, and you should understand the criteria by which it will be judged.
However, this class is not just about the final product. There are often things that do not show up in your git commit history. Did you take the time to get comfortable with reading man pages? Did you spend time building your gdb
skills? Did you explore bash
syntax to help you write creative tests? Did you overcome any challenging bugs or situations that you are proud of? Doing this takes time, and these are investments that you should be rewarded for making. (Hopefully the promise of making your life easier down the line is a pay-off, but your efforts should also be acknowledged now). So at the end of your Eval.md
, you should document your experience, including the ups and downs, and reflect on how you spent your time. Convince me and convince yourself that you spent the time to learn the material.
The format of the Eval.md
that I provide is a suggestion. Ultimately, how you reflect on your Lab Assignment is up to you.
When you have completed the lab, submit your code and evaluation using the appropriate git commands, such as:
git status
$ git add ...
$ git commit -m "final submission"
$ git push $
Verify that your changes appear on GitHub by navigating to your private repository using the web interface. It should be available at https://github.com/williams-cs/cs333lab1-{USERNAME}
. You should see all changes reflected in the various files that you submit. If not, go back and make sure you committed and pushed. I will be retrieving all lab code from GitHub, so if your changes are not visible to you on GitHub, they will not be visible to me either. I want to make sure everyone receives credit for their work!
Even before writing concrete tests, you can incrementally test your programs by hand. If you think about it, a bash
script is a series of commands that you could have executed at the command line. Consider running your programs and using shell redirection (>
) to write the output to a temporary file. If you do the same with the standard cat
and grep
programs, you can compare the outputs.
The diff
tool is incredibly helpful for comparing two files; it gives much more information in its output than cmp
(although the yes/no nature of cmp
makes it ideal for certain tests). With diff
, you will quickly see whether your behavior matches exactly, and if not, you can see how your output differs.
For my-zip
and my-unzip
, you may wish to use diff
to compare the uncompressed version of a file against the original; although that does not test every case that my-zip
and my-unzip
must handle (e.g., multiple files as input), a well organized implementation of a single-file-case test could be generalized to a test that invokes the multi-file cases.
Start early on my-cat.c
. The remaining three programs have a similar foundation, so completing my-cat.c
as early as possible will help you to create a budget/schedule for your remaining time.
Stop by my office hours, the TA help hours, and ask questions on slack as soon as you get stuck.
Please collaborate as much as possible under the allowed guidelines. Do whatever helps you learn the material best.
You should be able to use gdb
on your code. If you are unfamiliar with gdb
, Google and your instructor/TA/classmates can help. This GDB info page from Harvey Mudd and this GDB quick reference from the Univ of Texas may also be very helpful. REMEMBER TO COMPILE YOUR CODE WITH -g
IF YOU WANT TO RUN IT IN A DEBUGGER!!
There are many interesting extensions that you could attempt if you want to add more functionality to your programs. (Note that if you attempt these extensions, it may alter the spec, which is OK!)
Many programs interpret “flags” passed in from the command line in order to affect their behavior. The cat
and grep
utilities are among them.
Adding support for “flags” is something that can be done using the getopt()
or getoptlong()
library functions. There is even more documentation elsewhere, with examples.
I also encourage you to look at the man pages for grep
in particular, and see if there are interesting features you’d like to try out. For instance:
-i
flag, grep
ignores case-v
flag, grep
ignores lines that match and instead prints lines that do not match.-r
flag, grep
recursively runs on all files under each directory. This last one is hard, but this is also the most useful feature in the bunch!If you attempt any extensions, be sure to document them in your Eval.md
file and adjust your assessment to match your deeper engagement with the material.
This contents of this lab borrow from labs written by Remzi and Andrea Arpaci-Dusseau as part of their OSTEP course materials.