Let's try to use all the good things you learned about regular languages and finite state
machines to try to figure out how a program like Lex works. (We will save figuring Yacc
out for after break).
Lex starts with a collection of regular expressions.
With a bit of luck, you
remember that the set of languages described by regular expressions is identical
to the set of languages that can be recognized by finite state automata (and the
set described by regular grammars).
With this in mind, a reasonable strategy for
building a program that recognizes strings that match regular expressions is to
build the FSA that corresponds to the regular expressions in a Lex input file.
First, recognize that there are a number of things we can do to reduce the apparent
complexity of the language of regular expressions before we even begin.
For example, Lex allows expressions of the form +. Any such expression
can be rewritten as (*). Accordingly, if we figure out how
to handle concatenation and *, we don't have to worry about +. We can just rewrite
any expression that uses + before we start trying to build an FSA.
Through similar tricks we can get the set of regular expressions we have to worry about
down to single symbols from our alphabet, concatenation, alternation ( | ), and
the Kleene star (*).
Next, recall the structure of a deterministic finite state machine.
A finite set of states, .
An input alphabet, .
A transition function : x -> .
A subset F of called the set of final
states.
An element 0 of called the initial state.
Although what I have shown above is the standard way to define a FSA,
it is worth noting that there is one oddly unmathematical aspect of this
definition. The transition function is undefined on many elements in its
domain. When it is helpful, we can eliminate this oddity by adding an
error state with the idea that in any case where would have
been undefined we will define its result to be the error state.
While you are at it, recall (or at least note) that we can explain
the behavior of a deterministic finite state machine by defining
a function that extends to strings over the input
alphabet. In particular, we can define
: x * -> recursively as
( , ) =
( , x ) = ( ( , ), x )
and then state that the language accepted by the machine is
{ * | ( 0, ) F }
Building a FSA to match a single symbol x is quite easy.
It will have two states (call them 0 and 1).
The initial state will be 0 and 1 will be the only final state.
The transition function will be defined as ( 0, x ) = 1.
Building an FSA to match an RE of the form (i.e. concatenation)
is a bit trickier.
We would like to do the job recursively. That is, start by building FSAs for
and independently.
Two issues complicate connecting the two machines.
The machine for may have many final states.
It may not be clear when we should take a transition from one of these
final states to the initial state for .
Suppose that is a* and is aaab*.
We can simplify this issue by giving up one (very key) property of the automaton we have
been building, its deterministic behavior. That is, if we allow ourselves to build a
non-deterministic finite state machine instead of deterministic one, the construction becomes
much easier.
A non-deterministic FSA may include -transitions.
We can connect all the final states in the machine for with
-transitions to the initial state in the machine for .
So, now consider how to define a non-deterministic finite state machine:
A finite set of states, .
An input alphabet, .
A transition function : x (+ )
-> 2.
A subset F of called the set of final
states.
An element 0 of called the initial state.
As we did for FSA, we would like to define an extension of to a function
that handles transitions on strings rather than single symbols from .
Unlike the for an FSA, the for a NFSA must accept a set
of states and a string as its input:
: 2 x (+ )*
-> 2
The first step in defining is to take care of the productions
by defining the closure of a set of states:
We define the closure of a set of states, closure(P) to be the smallest set P'
such that:
P' P.
if i P' and j( i,
) then j P'.
Given this definition, we can say
( P, ) = closure( P )
( P, x ) = closure( ( ( P, ), x ) ).
Given this definition, we can define the language defined by a NFSA to be the set
of strings for which the set of state obtained by applying to the initial
state of the machine and the input produce a set of states that includes at least one
of the machine's final states.
Working with NFSAs also makes it easy to construct a machine for a RE of the form
*.
We add a state that is both an initial and final state.
We add epsilon transitions from this state to the initial state for .
We add epsilon transitions from the final states for 's machine
to this new initial state.
Now we can build a complete NFSA for each RE provided in a Lex input file. This leaves
us with two problems.
We really need a deterministic machine
We really want one BIG machine rather than a separate machine for each RE.