Lecture 10

Harvey Mudd College
Computer Science 131
Programming Languages
Spring Semester 2000

Lecture 10

Concrete Syntax vs. Abstract Syntax
Concrete Syntax vs. Abstract Syntax for Project 4
Specifying Syntax

As we discussed last lecture, in discussing programming languages, one must first distinguish between the languages syntax, the rules that determine what strings of symbols are legitimate programs in the language, and its semantics, the rules that determine the meaning of a given string of symbols.

We spent the last lecture talking about semantic, and in general, it is the semantics that researchers are interested in. What grammar rules are chosen for a language are rarely very important. So, often when presenting examples, a Programming Languages person will pick some non-specific artificial syntax that is generic to the class of languages she is demonstrating some point about. They will not, for example, use C or Pascal syntax, but rather some generic block-structured-imperative-language syntax.

Nevertheless, to build an interpreter or compiler for a language, we must be in a position to analyze it's grammar. Therefore, for the next three or so lectures, we will be discussing the problem of parsing. That is, analyzing a string of symbols to see if it belongs to the language we are interested in, and converting it to some internal, more manageable form that we can hand to the interpreter/code-generator.

The most common way to specify the grammar of a language is in what is known as Backus-Naur Form or BNF. There are several different styles for writing BNF rules. This is a common one.

A partial BNF for ANSI C functions might be given as:

    fundef ::= [type_id] id "(" [param_list] ")" block
 
    block ::= "{" [local_list] stat_list "}"
 
    param_list ::= [type_id] id
                 | [type_id] id "," param_list
 
    local_list ::= type_id id_list ";" [local_list]
 
    id_list ::= id 
              | id "," id_list
 
    stat_list ::= statement ";" [stat_list]

We will discuss how to build and analyze grammars starting next class. For the rest of this lecture I want to discuss the difference between concrete syntax and abstract syntax and how they relate to your next assignment. Consider the following two functions:

fun fact x = if (x = 0) 
               then 1
               else x * (fact (x - 1)));

function swap( var x:integer, var y:integer) : integer;

var temp:integer;

begin
   temp := x;
   x := y;
   y := temp;
end;

Does everyone recognize these languages and understand what the functions do?

Well, whatever you're thinking, you're almost certainly wrong. When I wrote them I had in mind a rex program and a c program, and I argue that that is exactly what they are, for any purpose that you could really care about.

The BNF for C functions above is concrete syntax. It is written in terms of the actual strings of symbols that we will use to distinguish parts of a program. Clearly, though, at the interpreter/code-generator level we are not interested in whether C uses brackets or begin/end pairs for delimiting blocks. We can describe the structure of a function much more analytically as:

    function = Function(name : identifier, return : type, 
                        parameters : var list, body : block)
     
    var = Var(name : identifier, type : type)
 
    block = Block(locals : var list, code : statement list)

Notice that this abstracts away many details. For instance, we no longer care that the parameter list for a function and the list of local variables are defined with different syntaxes.

The beauty of this idea (or perhaps the beauty of ML) is that it should be obvious that abstract syntax can be mimiced directly in ML datatypes. And ML programs can be easily built to analyze these structures.

So, for example, the C function:

int foo(int x)
{
   x++;
   return x;
}

could be represented by the ML value:

val foo = Function(Id "foo", Type "int", [Var (Id "x", Type "int")],
                   Block([],[Inc(Var (Id "x")),Return(Var (Id "x"))]));

I want to spend a few minutes talking to you about the syntax of the languages we are writing interpreters for. These languages will all use a variant of the syntax of the languages Lisp and Scheme. These languages have adopted a very simple syntax, called s-expression syntax. The advantage of this syntax is that there is almost nothing to it: there are no special symbols used for marking blocks and the like -- everything is done with parentheses; there are no infix operators -- everything is written in prefix notation.

So, for example, the factorial function is written:

(defun (fact n)
   (if (= n 0)
       1
       (* n (fact (- n 1)))))

While it looks like an sexpression is either an atom (like "if", or "1") or a list of sexpressions in parentheses, the full syntax is a bit more general. An sexpression is either an atom (as just described, but also including the atom named "nil" also written "()") or a pair of atoms in parentheses, separated by a dot. In addition, the following syntax simplification rule can be applied: If a dot precedes a left parentheses, then the dot, the left paren, and its matching right paren are deleted.

Thus, (= n 0) is actually an abbreviation for (= . (n . (0 . ()))).

Unlike in ML it is possible for the right-hand argument not to be a list. Either side of the dot can be an arbitrary sexpression. This is essentially the same as the situation with the bracket-and-bar notation for lists in rex. Any time a | occurs before a pair of square brackets, [], rex deletes all three, and puts a comma in place of the bar.

Ordinarily, the first stage of an interpreter is the parser, which converts the text of the object program into a representation in abstract syntax. However, while I did not want you to have to deal with the ins and outs of developing a parser in ML and examining the input strings of characters directly, I did want you to have some feel for the process of converting from concrete to abstract syntax. Therefore the parser I wrote does the bare minimum. Essentially, all it does is tokenize the input (more on that below) and recognize the nesting structure of the sexpressions. It then returns you an ML value corresponding closely to the input. This value is an element of the datatype:

    datatype sexp =
        Atom of atom
      | $ of (sexp * sexp)
 
    and atom =
        Nil
      | Symbol of string
      | Number of number
 
    and number =
        Int of int
      | Real of real

I refer to this datatype as an intermediate syntax, since it is neither concrete, nor abstract, though it is closer to the former. In fact it is the abstract syntax of s-expressions, but not of the language we are building.

To parse an input string, you call the function SexpParser.parse_string_for_sexp that is loaded as part of the new binary sml-sexp. For example:

- SexpParser.parse_string_for_sexp "(* 3 4)";
val it =
   Atom (Name "*") $ Atom (Number (Int 3)) $
   Atom (Number (Int 4)) $ Atom Nil
: ConcTerms.sexp

While we could send this value directly to the interpreter, we have already discussed the advantage of a more abstract syntax. What is this expression really? It is an operator, "*", applied to two operands, "3" and "4". It is your job, as the first step of each phase of the project, to write a function that translates programs in the intermediate syntax to abstract syntax. For the full language, as constructed in phase 3 of the project, the abstract syntax of "calculator expressions" is given by the types:

    datatype vname = Vname of string;
    datatype operator = Operator of string
 
    datatype cexpression =
        Cdata of number
      | Cvar of vname
      | Cappl of operator * cexpression list

So, for example:

- SexpParser.parse_string_for_sexp "(* 3 4)";
val it =
   Atom (Symbol "*") $ Atom (Number (Int 3)) $ Atom (Number (Int 4)) $ Atom Nil
  : SexpParser.sexp
 
- SyntaxConverter.convert_expression it;
val it = Cappl (Operator "*",[Cdata (Int 3),Cdata (Int 4)])
  : AbstractSyntax.cexpression
 
 
- SexpParser.parse_string_for_sexp "(+ (* 3 4) 2)";
val it =
  Atom (Symbol "+") $
  (Atom (Symbol "*") $
   Atom (Number (Int 3)) $ Atom (Number (Int 4)) $ Atom Nil) $
  Atom (Number (Int 2)) $ Atom Nil : SexpParser.sexp
 
- SyntaxConverter.convert_expression it;
val it =
  Cappl
    (Operator "+",
     [Cappl (Operator "*",[Cdata (Int 3),Cdata (Int 4)]),
      Cdata (Int 2)])
  : AbstractSyntax.cexpression

Similarly, there is an abstract syntax data type for the commands of the language:

 
    datatype ccommand =
        Cexpression of cexpression
      | Cdefine of vname * cexpression
      | Cdefun of operator * vname list * cexpression
      | Cbindings
      | Cexit

- SexpParser.parse_string_for_sexp "(define x (+ (* 3 4) 2))";
val it =
  Atom (Symbol "define") $
  Atom (Symbol "x") $
  (Atom (Symbol "+") $
   (Atom (Symbol "*") $
    Atom (Number (Int 3)) $ Atom (Number (Int 4)) $ Atom Nil) $
   Atom (Number (Int 2)) $ Atom Nil) $ Atom Nil : SexpParser.sexp
 
- SyntaxConverter.convert_command it;
val it =
  Cdefine
    (Vname "x",
     Cappl
       (Operator "+",
        [Cappl (Operator "*",[Cdata (Int 3),Cdata (Int 4)]),
         Cdata (Int 2)]))
  : AbstractSyntax.ccommand
 

- SexpParser.parse_string_for_sexp "(defun (f x y) (+ (* 3 x) y))";
val it =
  Atom (Symbol "defun") $
  (Atom (Symbol "f") $ Atom (Symbol "x") $ Atom (Symbol "y") $ Atom Nil) $
  (Atom (Symbol "+") $
   (Atom (Symbol "*") $ Atom (Number (Int 3)) $ Atom (Symbol "x") $ Atom Nil) $
   Atom (Symbol "y") $ Atom Nil) $ Atom Nil : SexpParser.sexp
 
- SyntaxConverter.convert_command it;
val it =
  Cdefun
    (Operator "f",[Vname "x",Vname "y"],
     Cappl
       (Operator "+",
        [Cappl (Operator "*",[Cdata (Int 3),
                              Cvar (Vname "x")]),
         Cvar (Vname "y")])) : AbstractSyntax.ccommand

	This page copyright ©2000 by Joshua S. Hodas. It was built on a Macintosh. Last rebuilt on Wednesday, February 21, 2000 at 9:00 AM.
http://www.cs.hmc.edu/~hodas/courses/cs131/lectures/lecture10.html

Harvey Mudd College Computer Science 131 Programming Languages Spring Semester 2000

Lecture 10

This page copyright ©2000 by Joshua S. Hodas. It was built on a Macintosh. Last rebuilt on Wednesday, February 21, 2000 at 9:00 AM.

http://www.cs.hmc.edu/~hodas/courses/cs131/lectures/lecture10.html

Harvey Mudd College
Computer Science 131
Programming Languages
Spring Semester 2000