CS 434 Lecture Notes -- How Lex Works

How Lex Works

Let's try to use all the good things you learned about regular languages and finite state machines to try to figure out how a program like Lex works. (We will save figuring Yacc out for after break).

Lex starts with a collection of regular expressions.

With a bit of luck, you remember that the set of languages described by regular expressions is identical to the set of languages that can be recognized by finite state automata (and the set described by regular grammars).
With this in mind, a reasonable strategy for building a program that recognizes strings that match regular expressions is to build the FSA that corresponds to the regular expressions in a Lex input file.

First, recognize that there are a number of things we can do to reduce the apparent complexity of the language of regular expressions before we even begin.

For example, Lex allows expressions of the form ⁺. Any such expression can be rewritten as (^*). Accordingly, if we figure out how to handle concatenation and *, we don't have to worry about +. We can just rewrite any expression that uses + before we start trying to build an FSA.
Through similar tricks we can get the set of regular expressions we have to worry about down to single symbols from our alphabet, concatenation, alternation ( | ), and the Kleene star (*).

Next, recall the structure of a deterministic finite state machine.

A finite set of states, .
An input alphabet, .
A transition function : x -> .
A subset F of called the set of final states.
An element ₀ of called the initial state.

Although what I have shown above is the standard way to define a FSA, it is worth noting that there is one oddly unmathematical aspect of this definition. The transition function is undefined on many elements in its domain. When it is helpful, we can eliminate this oddity by adding an error state with the idea that in any case where would have been undefined we will define its result to be the error state.

While you are at it, recall (or at least note) that we can explain the behavior of a deterministic finite state machine by defining a function that extends to strings over the input alphabet. In particular, we can define : x ^* -> recursively as

( , ) =
( , x ) = ( ( , ), x )

and then state that the language accepted by the machine is

{ ^* | ( ₀, ) F }

Building a FSA to match a single symbol x is quite easy.

It will have two states (call them 0 and 1).
The initial state will be 0 and 1 will be the only final state.
The transition function will be defined as ( 0, x ) = 1.

Building an FSA to match an RE of the form (i.e. concatenation) is a bit trickier.

We would like to do the job recursively. That is, start by building FSAs for and independently.
Two issues complicate connecting the two machines.
- The machine for may have many final states.
- It may not be clear when we should take a transition from one of these final states to the initial state for .
  - Suppose that is a^* and is aaab^*.

We can simplify this issue by giving up one (very key) property of the automaton we have been building, its deterministic behavior. That is, if we allow ourselves to build a non-deterministic finite state machine instead of deterministic one, the construction becomes much easier.

A non-deterministic FSA may include -transitions.
We can connect all the final states in the machine for with -transitions to the initial state in the machine for .

So, now consider how to define a non-deterministic finite state machine:

A finite set of states, .
An input alphabet, .
A transition function : x (+ ) -> 2.
A subset F of called the set of final states.
An element ₀ of called the initial state.

As we did for FSA, we would like to define an extension of to a function that handles transitions on strings rather than single symbols from .

Unlike the for an FSA, the for a NFSA must accept a set of states and a string as its input:
: 2 x (+ )^* -> 2
The first step in defining is to take care of the productions by defining the closure of a set of states:
We define the closure of a set of states, closure(P) to be the smallest set P' such that:
- P' P.
- if _i P' and _j ( _i, ) then _j P'.
Given this definition, we can say
- ( P, ) = closure( P )
- ( P, x ) = closure( ( ( P, ), x ) ).
Given this definition, we can define the language defined by a NFSA to be the set of strings for which the set of state obtained by applying to the initial state of the machine and the input produce a set of states that includes at least one of the machine's final states.

Working with NFSAs also makes it easy to construct a machine for a RE of the form ^*.

We add a state that is both an initial and final state.
We add epsilon transitions from this state to the initial state for .
We add epsilon transitions from the final states for 's machine to this new initial state.

Now we can build a complete NFSA for each RE provided in a Lex input file. This leaves us with two problems.

We really need a deterministic machine
We really want one BIG machine rather than a separate machine for each RE.

To be continued.... ?????

How Lex Works