Parsing: The Problem of Finding a Derivation

Top-down Parsing

Top-down Parsing

At any step in the process of a top-down parse, one has a sentential form that one wishes to re-write so that it more closely matches the target string (). At each such step one must make two choices:
1. which of the non-terminals in the current sentential form to replace (left-most to support left-to-right processing of input).
2. which production to apply to the selected non-terminal.
To eliminate the evil influence of intuition, let's consider finding a derivation given a not very meaningful grammar:.
< S > a < R > | b < S > b < R >
< R > b < R > | a

Consider the process of using a top-down parser to find a derivation for 'bbaababa' relative to the grammar given above.

Matched terminals	Tail of sentential form	Pending input
	< S >	bbaababa
b	< S > b < R >	baababa
bb	< S > b < R > b < R >	aababa
bba	< R > b < R > b < R >	ababa
bbaab	< R > b < R >	aba
bbaabab	< R >	a
bbaababa

Note that:
1. If we concatenate "Matched" and "Tail of Sentential Form" we always obtain a complete sentential form of the grammar,
2. the "Tail of Sentential Form" column behaves like a stack.
A top down parser can be implemented by explicitly maintaining this stack as a data structure.
To make our parse "deterministic", we want to decide how to expand the first terminal on the stack based only on what we have matched so far and on some finite prefix of the remaining input. If this is possible using a prefix of lenght k, we say that the grammar is LL(k).
In many cases, this is not possible for any k.
- Consider the productions:
  
  < stmt > if < expr > then < stmt > end
  
  | if < expr > then < stmt > else < stmt > end
- Suppose that we have generated a sentential form in which the left-most non-terminal is < stmt > and the next input characters to be read is "if". Which production should we choose?
For most languages, however, we can find a grammar in which one can determine which production to use next by just looking at the first unmatched character. Such a grammar is called an LL(1) grammar.
The following grammar:

< stmt > if < expr > then < stmt > < iftail >

< iftail > else < stmt > end

| end

is obviously LL(1) because:
1. The right hand side of each production begins with a terminal, and
2. if two productions have the same left hand side, then their right hand sides begin with different terminal symbols.
A grammar with these two properties is said to be an S-grammar. Any S-grammar is LL(1).
In the case of top-down parsing, this assumption of a determinisitic parse produced using a single scan through the input leads to the production of left-most derivations.
- If at some point we have derived the sentential form xA (where x is a string of terminals, A is a non-terminal and is composed of terminals and non-terminals) while trying to parse , we would want to read in at least the prefix x of before proceeding further.
- If we expand A at this point, any prefix of terminals included in the left-hand side we substitute for A will need to be checked against the next input characters following x.
- If we instead expand some non-terminal in we will either need to read past all the terminals that will eventually be matched by A (saving them so that we can check that they match later) or have to remember to check the correctness of the substitution made for A later.

One of the attractions of top down parsing is that there is a simple scheme for implementing a top down parser in any language that supports recursion. The following procedure skeletons show how such a "recursive descent" parser for the S-grammar:

< S > a < R > | b < S > b < R >
< R > b < R > | a

would look (it assumes that "ch" holds the next input character to be processed):


procedure R;
    if ch = 'b' then
         getnextchar;
         R;
    else if ch = 'a' then
         getnextchar;
    else 
         error
    end
end R;

procedure S;
    if ch = 'a' then
         getnextchar;
         R;
    else if ch = 'b' then
         getnextchar;
         S;
         if ch = 'b' then
             getnextchar;
         else
             error;
         end;
         R;
    end
end R

One of the nice things about recursive descent parsing is that you can "massage" the code instead of the grammar.

Computer Science 434
Department of Computer Science
Williams College

Top-down Parsing