Lecture 11

Harvey Mudd College
Computer Science 131
Programming Languages
Spring Semester 1999

Lecture 11 (3/3/99)

Concrete Syntax vs. Abstract Syntax for Project 4
Introduction to Parsing

I want to spend a few minutes talking to you about the syntax of the languages we are writing interpreters for. These languages will all use a variant of the syntax of the languages Lisp and Scheme. These languages have adopted a very simple syntax, called s-expression syntax. The advantage of this syntax is that there is almost nothing to it: there are no special symbols used for marking blocks and the like -- everything is done with parentheses; there are no infix operators -- everything is written in prefix notation.

So, for example, the factorial function is written:

(defun (fact n)
   (if (= n 0)
       1
       (* n (fact (- n 1)))))

While it looks like an sexpression is either an atom (like "if", or "1") or a list of sexpressions in parentheses, the full syntax is a bit more general. An sexpression is either an atom (as just described, but also including the atom named "nil" also written "()") or a pair of atoms in parentheses, separated by a dot. In addition, the following syntax simplification rule can be applied: If a dot precedes a left parentheses, then the dot, the left paren, and its matching right paren are deleted.

Thus, (= n 0) is actually an abbreviation for (= . (n . (0 . ()))).

Unlike in ML it is possible for the right-hand argument not to be a list. Either side of the dot can be an arbitrary sexpression. This is essentially the same as the situation with the bracket-and-bar notation for lists in rex. Any time a | occurs before a pair of square brackets, [], rex deletes all three, and puts a comma in place of the bar.

Ordinarily, the first stage of an interpreter is the parser, which converts the text of the object program into a representation in abstract syntax. However, while I did not want you to have to deal with the ins and outs of developing a parser in ML and examining the input strings of characters directly, I did want you to have some feel for the process of converting from concrete to abstract syntax. Therefore the parser I wrote does the bare minimum. Essentially, all it does is tokenize the input (more on that below) and recognize the nesting structure of the sexpressions. It then returns you an ML value corresponding closely to the input. This value is an element of the datatype:

    datatype sexp =
        Atom of atom
      | $ of (sexp * sexp)
 
    and atom =
        Nil
      | Symbol of string
      | Number of number
 
    and number =
        Int of int
      | Real of real

I refer to this datatype as an intermediate syntax, since it is neither concrete, nor abstract, though it is closer to the former. In fact it is the abstract syntax of s-expressions, but not of the language we are building.

To parse an input string, you call the function SexpParser.parse_string_for_sexp that is loaded as part of the new binary sml-sexp. For example:

- SexpParser.parse_string_for_sexp "(* 3 4)";
val it =
   Atom (Name "*") $ Atom (Number (Int 3)) $
   Atom (Number (Int 4)) $ Atom Nil
: ConcTerms.sexp

While we could send this value directly to the interpreter, we have already discussed the advantage of a more abstract syntax. What is this expression really? It is an operator, "*", applied to two operands, "3" and "4". It is your job, as the first step of each phase of the project, to write a function that translates programs in the intermediate syntax to abstract syntax. For the full language, as constructed in phase 3 of the project, the abstract syntax of "calculator expressions" is given by the types:

    datatype vname = Vname of string;
    datatype operator = Operator of string
 
    datatype cexpression =
        Cdata of number
      | Cvar of vname
      | Cappl of operator * cexpression list

So, for example:

- SexpParser.parse_string_for_sexp "(* 3 4)";
val it =
   Atom (Symbol "*") $ Atom (Number (Int 3)) $ Atom (Number (Int 4)) $ Atom Nil
  : SexpParser.sexp
 
- SyntaxConverter.convert_expression it;
val it = Cappl (Operator "*",[Cdata (Int 3),Cdata (Int 4)])
  : AbstractSyntax.cexpression
 
 
- SexpParser.parse_string_for_sexp "(+ (* 3 4) 2)";
val it =
  Atom (Symbol "+") $
  (Atom (Symbol "*") $
   Atom (Number (Int 3)) $ Atom (Number (Int 4)) $ Atom Nil) $
  Atom (Number (Int 2)) $ Atom Nil : SexpParser.sexp
 
- SyntaxConverter.convert_expression it;
val it =
  Cappl
    (Operator "+",
     [Cappl (Operator "*",[Cdata (Int 3),Cdata (Int 4)]),
      Cdata (Int 2)])
  : AbstractSyntax.cexpression

Similarly, there is an abstract syntax data type for the commands of the language:

 
    datatype ccommand =
        Cexpression of cexpression
      | Cdefine of vname * cexpression
      | Cdefun of operator * vname list * cexpression
      | Cbindings
      | Cexit

- SexpParser.parse_string_for_sexp "(define x (+ (* 3 4) 2))";
val it =
  Atom (Symbol "define") $
  Atom (Symbol "x") $
  (Atom (Symbol "+") $
   (Atom (Symbol "*") $
    Atom (Number (Int 3)) $ Atom (Number (Int 4)) $ Atom Nil) $
   Atom (Number (Int 2)) $ Atom Nil) $ Atom Nil : SexpParser.sexp
 
- SyntaxConverter.convert_command it;
val it =
  Cdefine
    (Vname "x",
     Cappl
       (Operator "+",
        [Cappl (Operator "*",[Cdata (Int 3),Cdata (Int 4)]),
         Cdata (Int 2)]))
  : AbstractSyntax.ccommand
 

- SexpParser.parse_string_for_sexp "(defun (f x y) (+ (* 3 x) y))";
val it =
  Atom (Symbol "defun") $
  (Atom (Symbol "f") $ Atom (Symbol "x") $ Atom (Symbol "y") $ Atom Nil) $
  (Atom (Symbol "+") $
   (Atom (Symbol "*") $ Atom (Number (Int 3)) $ Atom (Symbol "x") $ Atom Nil) $
   Atom (Symbol "y") $ Atom Nil) $ Atom Nil : SexpParser.sexp
 
- SyntaxConverter.convert_command it;
val it =
  Cdefun
    (Operator "f",[Vname "x",Vname "y"],
     Cappl
       (Operator "+",
        [Cappl (Operator "*",[Cdata (Int 3),
                              Cvar (Vname "x")]),
         Cvar (Vname "y")])) : AbstractSyntax.ccommand

	This page copyright ©1999 by Joshua S. Hodas. It was built on a Macintosh. Last rebuilt on Wednesday, March 3, 1999 at 2:00 PM.
http://cs.hmc.edu/~hodas/courses/cs131/lectures/lecture11.html

Harvey Mudd College Computer Science 131 Programming Languages Spring Semester 1999

Lecture 11 (3/3/99)

This page copyright ©1999 by Joshua S. Hodas. It was built on a Macintosh. Last rebuilt on Wednesday, March 3, 1999 at 2:00 PM.

http://cs.hmc.edu/~hodas/courses/cs131/lectures/lecture11.html

Harvey Mudd College
Computer Science 131
Programming Languages
Spring Semester 1999