Lecture 10

CS 131 - Programming Languages (Lecture 10: Lexical Analysis and Regulare Expressions)

LOW LEVEL PARSING: LEXICAL ANALYSIS

Recall that when I showed you a BNF for a fragment of C a couple of lectures back, that I said that the rules specified the legal ways to form a member of a given syntactic category in terms of certain literal strings and other syntactic categories. For example, we had the rule:

fundef ::= [type_id] id ( [param_list] ) block

We said that some of these categories, like param_list would be further defined in the BNF, but that others, like id and type_id would not be. This is because we don't care, at this level, what the rules are for forming variable and type names. To put it another way, spelling and grammar are different. We don't say that and English sentence with misspelled words is ungrammatical.

The categories for which we don't write BNF rules are called the terminal symbols or lexical tokens of the language. The notation used to describe the spelling rules of a language (and the methods used to check if a given string of characters is a valid token) are quite differnt from those used for the grammar.

For this reason the process of parsing a string of characters to determine if it belongs to a given language is usually properly divided into two separate phases, lexical analysis and syntactic analysis, sometimes refered to as tokenizing and parsing. (Note that this means that the term "parsing" is a bit ambiguous.) The lexical analyzer breaks the input into a stream of tokens of different classes which the parser then takes as input.

The rest of this lecture and the first half of the next will focus on the mechanics of lexical analysis.

The syntax of lexical tokens is specified using a mathematical notation known as regular expressions. You may already have used variants of regular expressions in searching for text within emacs, or in Unix using grep (which, in fact, stands for "get regular expression").

THE SYNTAX OF REGULAR EXPRESSIONS

The rules for forming regular expressions begin by specifying the overall set of characters allowed to occur in tokens. This set, historically denoted by the greek letter , is called the alphabet of your language. Any set of strings constructed of characters from the alphabet is called a language over . Regular expressions are used to denote the particular language over we are interested in.

For a given alphabet , regular expressions are formed according to the following rules:

The symbol is a regular expression, as is the symbol .
Each letter in is a regular expression.
If r₁ and r₂ are regular expressions, then (r₁ + r₂) is a regular expression. In some books this is written (r₁ | r₂)

If r₁ and r₂ are regular expressions, then (r₁ ^.r₂) is a regular expression.

If r is a regular expressions, then r* is a regular expression.

THE MEANING OF REGULAR EXPRESSIONS

Every regular expression represents a language over the alphabet . But which language? What is the relationship between regular expressions and languages? The meaning of a regular expression is determined by a set of rules that mirror the rules for the construction of regular expressions:

The symbol represents the empty set of strings. The symbol represents the singleton set of the empty string. (There is a difference!)
Each letter in represents the singleton set of the one-character string corresponding to that letter.
The regular expression (r₁ + r₂) denotes the set of strings that has all the strings in the set denoted by r₁ and all the strings from the set denoted by r₂. That is, it is the union of those two languages.

The regular expression (r₁ ^.r₂) represents the set of strings formed by taking any string from the language denoted by r₁ and concatenating some string from the language denoted by r₂.

The regular expression r* is the set of strings formed by concatenating together zero or more strings from the language denoted by r. That is, it is equivalent to ( + r + (r ^. r) + (r ^. r ^.r) + ... ).

This last operator is often called "Kleene-Star" in honor of Steven Kleene, a logician who was the first to use this notation for writing regular expressions and who made many contibutions to theoretical computer science.

It is common to extend the language of regular expressions with the term r⁺ (with the plus superscripted) which is just the set of strings formed by one or more concatenations of strings from the language denoted by r. This is just sugar, though, since r⁺ = (r . r*). Since it is convenient to have, though, I will shift to using | for union, to avoid ambiguity.

SOME EXAMPLES OF REGULAR EXPRESSIONS

Let's look at a few examples of regular expression for particular token classes. (I will put the object characters in single quotes to avoid ambiguity.):

Binary numbers of arbitrary length:
Real numbers requiring at least one digit on each side of the decimal point, with optional sign and optional optionally-signed exponent (I will use d for the set of digits, it is equivalent to ('0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9')):
SML identifiers, which must begin with a letter, and then may have any string of letters, digits, underscores, and primes (I will use l for the set of letters and d for the set of digits):
Ada identifiers, which must begin with a letter and then may have any string of letters, digits and underscores, with the proviso that underscores may only occur one at a time and cannot be the last character:

One regular expression describes the language of one class of tokens. Thus a lexical analyzer analyzes a string of characters (perhaps a white space delimited string, for example) and for each token determines which, if any, of the token class languages it has been told to look for the token is a member of. It is this information that is passed to the parser.