Woolite Compiler Implementation Project
Phase 3: Collecting Garbage
Due: really soon

 The final step in completing your compiler will be to implement a garbage collection system that will make it possible to run programs that require more memory for temporary results that is available in the 34000 simulator's address space.

The primary component of the garbage collection system will be the garbage collector itself. Recall, however, that this will also involve some changes to earlier components of your code. The garbage collector will depend on the use of an encoding scheme that enables it to distinguish pointers to objects from integers and other values. As we have discussed in class, this will involve changing the way your compiler generates code for arithmetic expressions so that the code will actually produce final and intermediate values that are twice as large as the expected values. You may also need to be very careful about what values are left in registers or on the stack at points where garbage collections might occur.

Using the c-- compiler

You will write the actual garbage collector using a compiler for a dialect of C that I wrote for the 34000 several years ago. The compiler is named c--. It can be found in ~tom/shared/434/woolite/bin. As its name suggests, the language accepted by this compiler falls somewhat short of compliance with the ANSII C standard. The differences between the language recognized by c-- and the standard C language are discussed below.

To compile a program using c-- simply type a command of the form

~tom/shared/434/woolite/bin/c- source-file-name

The source file name should end with a ".c" suffix. The compiler will write error messages (if any) to the standard error file (usually your screen) and send the 34000 assembly language code corresponding to the C-- in its input file to the standard output. As a result, what you will probably end up doing is frequently typing a command that looks like

~tom/shared/434/woolite/bin/c-- garbageman.c > garbageman.s

Of course, to make the code produced by the c-- compiler work together with the code produced by your Woolite compiler you have to run both files together with the iolib.s file through the assembler together. This could be done using #includes, but the assembler actually is capable of processing multiple source files. To help you take advantage of this ability, I have include a script named "wcc" in the ~tom/shared/434/woolite/phase3 directory. You should copy this file to your directory (you should also change your Makefile to say "phase3" and do a make depend).

The line within the wcc script that actually runs the assembler looks like:

~tom/shared/434/bin/wc34asm -l $1:r.s  garbageman.s IOlib.s >  > $1:r.l
The text $1:r.s tells it to include the output of running your Woolite compiler. The names garbageman.s and IOlib.s are supposed to refer to the files containing the code c-- produced when asked to compile your garbage collector code and the code for the IO library (Don't forget that you will have to change the IO library to deal with the fact that all integer values have been multiplied by 2). You can edit the script to change either of these names if necessary. Running this script should produce a tmem file containing all of the code needed to run a Woolite program. Also, the script is designed to leave a copy of the listing file produced by the assembler in a file formed by replacing the .s suffix on the name of the file produced by your Woolite compiler with a .l suffix.

The weakest component of c-- is its ability to handle syntactic and semantic errors in the source program. The messages it produces in response to syntax errors could easily be more informative. The amount of time you spend enduring this weakness of my compiler can be greatly reduced by first compiling code using the standard C compiler. While gcc won't produce code for the WC34000 machine, it will produce nice error messages that will help you clean your code up before submitting it to c--.

I should also admit that the compiler has other quirks. In past semesters, students have adjusted to them quickly. Please report any odd behavior to me, however. I'll try to fix them as quickly as I can.

The language accepted by c--

The most significant differences between C and the language accepted by c-- involve the overall structure of the declarations and definitions that comprise a program. First, c-- does not support separate compilation. Second, c--'s scope rules for external variables allow forward references. In other words, the scope of an external declaration is the entire program, not just those parts of the source code that appear after the declaration. Finally, since the c-- compiler was written slightly before the dawn of time (i.e. 1987), it requires the use of a syntax for function headers that was revised with the adoption of ANSI C.

The first two unusual features of c-- make the distinction between declarations and definitions in C unnecessary. There is no need to declare functions before they are defined since forward references are allowed. Thus, while in C it is common to place a function declaration such as:

double pop();

at the beginning of a source file (or in a .h file) so that the function described can be referenced before it is defined, c-- does not support such function declarations. In a c-- source file every function header must be immediately followed by the appropriate function body. Similarly, multiple declarations of external variables are not allowed.

The old style of function declarations that c-- uses requires that the types of function parameters be included in separate, variable-like declarations after a function's header line rather than with the parameter names in the header. Thus, a function that would be declared as

int searchforchar( char *source, char key, int occurrence)
{
...
}

using the current C syntax must be declared as

int searchforchar( source, key, occurrence)
char *source;
char key;
int occurrence;
{
...
}

when using c--.

There are several other less significant differences between C declarations and the declarations that c-- will accept. None of the "storage class specifiers" that may be included in C declarations are supported by c--. These include auto, static, extern, register and typedef. The only one of these modifiers you will really miss is typedef. Another limitation is that c-- does not support initializers in declarations. Finally, while c-- does support all the structured types of C including arrays, structures and unions, bit-field components are not allowed within structures.

The set of scalar types supported by c-- is simpler than that normally included in C. Only two scalar types are recognized: int and char. Values of both types are stored in 16 bit words (the only unit of storage provided by the WC34000).

c-- supports all of the usual types of expressions found in C, including all of the usual unary and binary operators, subscripting, component selection, pointer dereferencing, type castes and the sizeof pseudo-function. With the exception of the switch statement and the goto statement, all of the usual C statement types are also recognized.

Within the assembly code it produces, the compiler includes STAB directives that provide information to the debugger about variables defined in the procedure. A complete discussion of these STAB directives can be found in the handout "The WC34000 Assembler". For our purposes, however, all you need to know is that this means you will be able to use the source level debugging features of the 34000 when it is executing code within your garbage collector.

Communicating with the Garbage Collector

You will need to store a few values describing the state of the heap in memory in such a way that you can access them easily from either the code your Woolite compiler generates or from the C code of your garbage collector. The values you will need include:

You will need to access the first two of these values in the code your compiler generates for each "new" operation. Therefore, you will probably want to allocate them using DC and DS directives and associate them with code labels.

The code you generate for "new" operations in the final version of your compiler will first check to see if the space required for the new object exceeds the free space remaining in the half of the heap that is currently in use. If there is not enough space, the code you generate should invoke the garbage collector, which will simply be a function written in C that has been compiled separately using c--.

Generating code to invoke a method in the code produced by c-- is actually quite simple (if ad hoc). After you run c--, you can look at the generated code and see the label that the compiler placed on the first line of your main garbage collection function. Then modify your Woolite compiler to generate a JSR to this label when it needs to invoke the garbage collector.

When the garbage collector runs, it will need to access the four integers that describe the state of the heap. There is no (reasonable) way to force c-- to associate a C-level variable name with a particular label on a DC or DS. There is, however, a simple way to arrange to access the four values that describe the state of the heap within your C code.

As long as you allocate the four words that hold the values that represent the state of the heap together, you can define a C struct type that describes them. This struct type might look like:

struct heapDesc {
    int * nextFree;
    int * heapRegionEnd;
    int * heapStart;
    int * heapSize;
    }

If you have defined such a structure you can then define your garbage collector function so that it accepts the address of the first word this region as a parameter and treats it as a pointer to a struct of the type. For example, the function header for your collector might look like:

void collector(  heapInfo ) { 
    struct heapDesc * heapinfo;

Then, within your code you can write statements that access these values like:

   heapInfo->heapRegionEnd = heapInfo->heapStart + heapInfo->heapSize/2;

Your garbage collector must scan the stack examining all active stack frames for pointers to objects as part of the garbage collection process. To do this, it will need to determine the addresses of the first and lasts words of the stack. The address of the bottom of the stack is fixed. You can either type this value in as a constant or have your initialization code move the value of A7 to a known location (possibly a fifth word of the "heapDesc" struct). The address of the top of the stack varies depending on how many methods are active when garbage collection occurs. Luckily, there is a simple trick you can use in your C code to determine this value. When it call your garbage collector, the code generated by your Woolite compiler will pass it a pointer to the free pointer as a parameter. The address of the work that holds this parameter value will be one larger than the address of the topmost word on the stack that your garbage collector needs to scan. Accordingly, if you declare the "collector" method as suggested above, then the expression

heapInfo + 1
will return the address of the top of the stack.

The initialization code generated by your Woolite compiler must initialize the values stored in the four words that describe the state of the heap. You can either generate the code to do this directly, or include an initialization function in your garbage collector and generate code to invoke this method. In either case, you will probably want to use the fact that the assembler leaves the address of the start of the area after your program's code in word 1 of memory and either simply move this value to the "heapStart" word or pass it as a parameter to the initialization routine.

To debug your garbage collector, I suggest you start by setting the heap size to a very small value (100 words or less), so that the first collection happens quickly and doesn't have too much work to do. Only after this is working should you try to use larger heap sizes.


Computer Science 434
Department of Computer Science
Williams College