Woolite Compiler Implementation Project
Phase 3: Collecting Garbage
Due: really soon
The primary component of the garbage collection system will be the garbage collector itself. Recall, however, that this will also involve some changes to earlier components of your code. The garbage collector will depend on the use of an encoding scheme that enables it to distinguish pointers to objects from integers and other values. As we have discussed in class, this will involve changing the way your compiler generates code for arithmetic expressions so that the code will actually produce final and intermediate values that are twice as large as the expected values. You may also need to be very careful about what values are left in registers or on the stack at points where garbage collections might occur.
You will write the actual
garbage collector using a compiler for a dialect of C that I wrote for the 34000
several years ago.
The compiler is named
c--. It can be found in ~tom/shared/434/woolite/bin
. As its name
suggests, the language accepted by this compiler falls somewhat short
of compliance with the ANSII C standard. The differences between the
language recognized by c--
and the standard C language are
discussed below.
To compile a program using c--
simply type a command of the
form
~tom/shared/434/woolite/bin/c- source-file-name
~tom/shared/434/woolite/bin/c-- garbageman.c > garbageman.s
Of course, to make the code produced by the c-- compiler work together with the code produced by your Woolite compiler you have to run both files together with the iolib.s file through the assembler together. This could be done using #includes, but the assembler actually is capable of processing multiple source files. To help you take advantage of this ability, I have include a script named "wcc" in the ~tom/shared/434/woolite/phase3 directory. You should copy this file to your directory (you should also change your Makefile to say "phase3" and do a make depend).
The line within the wcc script that actually runs the assembler looks like:
~tom/shared/434/bin/wc34asm -l $1:r.s garbageman.s IOlib.s > > $1:r.lThe text
$1:r.s
tells it to include the output of running your Woolite compiler.
The names garbageman.s and IOlib.s are supposed to refer to the files containing
the code c-- produced when asked to compile your garbage collector code and
the code for the IO library (Don't forget that you will have to change the IO library
to deal with the fact that all integer values have been multiplied by 2).
You can edit the script to change either of these names
if necessary. Running this script should produce a tmem file containing all of the
code needed to run a Woolite program. Also, the script is designed to leave a copy of
the listing file produced by the assembler in a file formed by replacing the .s suffix
on the name of the file produced by your Woolite compiler with a .l suffix.
The weakest component of c--
is its ability to handle syntactic
and semantic errors in the source program. The messages it produces
in response to syntax errors could easily be more informative. The
amount of time you spend enduring this weakness of my compiler can be
greatly reduced by first compiling code using the standard C
compiler. While gcc won't produce code for the WC34000 machine, it
will produce nice error messages that will help you clean your code up
before submitting it to c--
.
I should also admit that the compiler has other quirks. In past semesters, students have adjusted to them quickly. Please report any odd behavior to me, however. I'll try to fix them as quickly as I can.
The most significant differences between C and the language accepted
by c--
involve the overall structure of the declarations and
definitions that comprise a program. First, c--
does not
support separate compilation.
Second, c--
's scope rules for external variables allow
forward references. In other words, the scope of an external
declaration is the entire program, not just those parts of the source
code that appear after the declaration. Finally, since the c-- compiler
was written slightly before the dawn of time (i.e. 1987), it requires
the use of a syntax for function headers that was revised with the
adoption of ANSI C.
The first two unusual features of c--
make the distinction between
declarations and definitions in C unnecessary. There is no
need to declare functions before they are defined since forward
references are allowed. Thus, while in C it is common to place
a function declaration such as:
double pop();
c--
does not support
such function declarations. In a c--
source file every
function header must be immediately followed by the appropriate
function body. Similarly, multiple declarations of external
variables are not allowed.
The old style of function declarations that c-- uses requires that the types of function parameters be included in separate, variable-like declarations after a function's header line rather than with the parameter names in the header. Thus, a function that would be declared as
int searchforchar( char *source, char key, int occurrence) { ... }
int searchforchar( source, key, occurrence) char *source; char key; int occurrence; { ... }
There are several other less significant differences between C declarations
and the declarations that c--
will accept. None of the
"storage class specifiers" that may be included in C declarations
are supported by c--
. These include auto
, static
,
extern
, register
and typedef
. The only one of these
modifiers you will really miss is typedef
. Another limitation
is that c--
does not support initializers in declarations.
Finally, while c--
does support all the structured types
of C including arrays, structures and unions, bit-field components
are not allowed within structures.
The set of scalar types supported by c--
is simpler than
that normally included in C. Only two scalar types are recognized:
int and char. Values of both types are stored in 16 bit words
(the only unit of storage provided by the WC34000).
c--
supports all of the usual types of expressions found
in C, including all of the usual unary and binary operators, subscripting,
component selection, pointer dereferencing, type castes and the
sizeof pseudo-function. With the exception of the switch
statement and the goto
statement, all of the usual C statement
types are also recognized.
Within the assembly code it produces, the compiler includes STAB directives that provide information to the debugger about variables defined in the procedure. A complete discussion of these STAB directives can be found in the handout "The WC34000 Assembler". For our purposes, however, all you need to know is that this means you will be able to use the source level debugging features of the 34000 when it is executing code within your garbage collector.
You will need to store a few values describing the state of the heap in memory in such a way that you can access them easily from either the code your Woolite compiler generates or from the C code of your garbage collector. The values you will need include:
The code you generate for "new" operations in the final version of your compiler will first check to see if the space required for the new object exceeds the free space remaining in the half of the heap that is currently in use. If there is not enough space, the code you generate should invoke the garbage collector, which will simply be a function written in C that has been compiled separately using c--.
Generating code to invoke a method in the code produced by c-- is actually quite simple (if ad hoc). After you run c--, you can look at the generated code and see the label that the compiler placed on the first line of your main garbage collection function. Then modify your Woolite compiler to generate a JSR to this label when it needs to invoke the garbage collector.
When the garbage collector runs, it will need to access the four integers that describe the state of the heap. There is no (reasonable) way to force c-- to associate a C-level variable name with a particular label on a DC or DS. There is, however, a simple way to arrange to access the four values that describe the state of the heap within your C code.
As long as you allocate the four words that hold the values that represent the state of the heap together, you can define a C struct type that describes them. This struct type might look like:
struct heapDesc { int * nextFree; int * heapRegionEnd; int * heapStart; int * heapSize; }
If you have defined such a structure you can then define your garbage collector function so that it accepts the address of the first word this region as a parameter and treats it as a pointer to a struct of the type. For example, the function header for your collector might look like:
void collector( heapInfo ) { struct heapDesc * heapinfo;
Then, within your code you can write statements that access these values like:
heapInfo->heapRegionEnd = heapInfo->heapStart + heapInfo->heapSize/2;
Your garbage collector must scan the stack examining all active stack frames for pointers to objects as part of the garbage collection process. To do this, it will need to determine the addresses of the first and lasts words of the stack. The address of the bottom of the stack is fixed. You can either type this value in as a constant or have your initialization code move the value of A7 to a known location (possibly a fifth word of the "heapDesc" struct). The address of the top of the stack varies depending on how many methods are active when garbage collection occurs. Luckily, there is a simple trick you can use in your C code to determine this value. When it call your garbage collector, the code generated by your Woolite compiler will pass it a pointer to the free pointer as a parameter. The address of the work that holds this parameter value will be one larger than the address of the topmost word on the stack that your garbage collector needs to scan. Accordingly, if you declare the "collector" method as suggested above, then the expression
heapInfo + 1will return the address of the top of the stack.
The initialization code generated by your Woolite compiler must initialize the values stored in the four words that describe the state of the heap. You can either generate the code to do this directly, or include an initialization function in your garbage collector and generate code to invoke this method. In either case, you will probably want to use the fact that the assembler leaves the address of the start of the area after your program's code in word 1 of memory and either simply move this value to the "heapStart" word or pass it as a parameter to the initialization routine.
To debug your garbage collector, I suggest you start by setting the heap size to a very small value (100 words or less), so that the first collection happens quickly and doesn't have too much work to do. Only after this is working should you try to use larger heap sizes.