Project 08

An Interpreter for a Lexically-Scoped Mini-Scheme

Due by 5:00 PM on Friday December 3, 1999

 

Purpose:

  In this project you will extend and modify the dynamically-scoped Lisp built in the last project so that it becomes a lexically scoped Scheme interperter with first-class functions. This will involve adding storage structures for functions as data and modifying the way in which s-expressions are stored. Most importantly, it will involve implementing types for environments and closures that behave in the desired way.

Background:

 
Lexical Scoping A major difference between this interpreter and the last is its use of lexical scope. This requires a deeper interleaving of environments (i.e. lists of identifier/value binding pairs) into other structures and routines. For example, it should now be possible to catch all references to undeclared identifiers at "compile" (i.e., syntax-conversion) time. But that means that the routines in SyntaxConverter need access to the current environment. Whenever an identifier is encountered, the environment needs to be checked to see if the identifier is defined. (And, when we are converting the body of a function, we must treat the formal parameters as having definitions. This requires augmenting the current environment with dummy bindings for those variables when descending into a function.)

Don't Panic!!! You don't have to write this yourself. I'm giving you a functioning SyntaxConverter to use.

Now, as we have discussed in class, in lexically-scoped languages, when you store a function definition you must include a pointer to the top of the environment active at the point of definition. When the function is applied to some arguments, the bindings of the formal parameters are added to this stored environment (rather than the current environment) to produce the environment in which the function's body is evaluated. If this were the only change we were making to the interpreter then it would simply involve adding a slot for an environment to the structure used to hold function definitions in the function definition list. Of course the stored environment would need to list both the variable and the function bindings.

First- Class Functions However, the other task for this project is to make functions first-class data objects, which will require additional, related changes. In one respect it will simplify the interpreter: there will no longer be separate environments (name spaces) for functions and variables. Instead there will only be variables, some of which will have functions as their values.

Unnamed Functions There will actually be two kinds of functions in this system: named functions defined in the traditional way with defun, and unnamed functions entered with the expression syntax:
(lambda (var_1 ... var_n) body) 
Of course, it is possible to bind a variable name to an unnamed function by writing:
(define name (lambda (var_1 ... var_n) body) 
but, as we will see below, this sort of named function will not allow recursion, whereas those defined with defun will. (While both sorts of function definitions will construct the same sort of data value when evaluated, the construction of that value will happen in different ways, so that those resulting from defuns allow recursion.)

In order to support the lambda syntax, a new case has been added to the type AbstractSyntax.cexpression:

| Clambda of vname list * cexpression 

When you encounter a lambda expression (or, more precisely, a Clambda structure) during evaluation, as described in class, the result is a closure, which includes the definition of the function (i.e. the same information as the Clambda) as well as a pointer to the current environment, which will be used if and when this function is ever applied. In order to support this we need a datavalue type that will allow us to store such structures. Therefore, in this project, rather than just being ConcreteSyntax.sexp the datavalue type will be a new multi-case data type. The full type defintion will be given below.

Now, if we were not worried about recursion, then we would let SML handle the details of creating the pointer to the current environment for us, by just building an ordinary data structure in this case for datavalue with the environment as one of its fields, as in:

    | Cclosure of vname list * cexpression   (* Formal params & body *)
                * (vname * datavalue) list   (* Current environment *)
The first two fields would store the information from the Clambda being evaluated, and the third would hold the the current environment at the time the definition is evaluated.

Named Functions However, you will recall from class that in order to support recursion, the closure constructed by a defun mus hold a slightly different environment. Instead of just storing the current environment in the closure, we must first add the binding of the function name to the closure to the current environment, then put a pointer to that resulting environment in the closure. That way the function's definition will be available for a recursive call.

This requires being able to build a pointer structure with a loop in it (The environment pointed to by the current environment pointer contains a closure which contains a pointer to the current environment.). Unfortunately (or, perhaps, fortunately, since it avoids alot of potential errors when such loops are unintended), you can't produce such loops with ordinary SML data structures.

As we saw in the coverage of streams, a conceptual loop can be produced with a thunk (a delayed or frozen object). They can also be created with refs, which, as we have seen, are just explicit pointers. This is the technique we will use in this project. Therefore, the datavalue case for a closure will be:

 
    | Cclosure of vname list * cexpression      (* Formal params & body *)
                * (vname * datavalue) list ref  (* Current environment *)
In the case of a evaluating a lambda, you just store a ref to the current environment, and leave it at that. For a closure resulting from evaluating a defun, we must construct the necessary loop.

Tying The Knot While I won't tell you exactly what to do to produce the loop we need, I'll give you an example of the general idea. Suppose we define the following data type:
 datatype foo = bar | baz of int * foo ref
We can construct some simple values of this type as in:
- val a = bar;
val a = bar : foo
- val b = baz (1, ref bar);
val b = baz (1,ref bar) : foo
- val c = baz (1, ref (baz (2, ref bar)));
val c = baz (1,ref (baz (2,ref bar))) : foo
Of course, if this is all we were going to do, there would be no reason to have put the ref in the pair (we could just have used baz of int * foo). But suppose we define the following values:
- val x = ref bar;
val x = ref bar : foo ref
- val y = baz (1, ref (baz (2, x)));
val y = baz (1,ref (baz (2,ref bar))) : foo
While the value of y looks the same as the value of c, the difference is that the bar pointed to by the innermost ref is not just any bar, it is the one in the cell pointed to by the ref in x. So, if we change the value that is at the end of the ref in x, it will change the one in y. For example, we could make the following change:
- x := baz (3, ref bar);
val it = () : unit
- y;
val it = baz (1,ref (baz (2,ref (baz (3,ref bar))))) : foo
The trick to creating the loop is to set the ref in x to point to y, thus creating a loop in y:
- x:= y;
val it = () : unit
- y;
val it = baz (1,ref (baz (2,ref (baz (1,%1)))) as %1) : foo
The bit with the percent sign is SML's (rather smart) way of avoiding going into a loop printing the value of y.

Paulson, the author of ML for the Working Programmer, refers to this process as tying the knot.

Thus, when you build the Cclosure for a defun, you first put a dummy environment at the end of the ref in the closure, then eventually replace that value with the overall environment that includes the binding of the function name to the closure, tying the knot.

Primitive Operators Actually, I lied. There aren't two kinds of functions in this system, there are three. The last kind is used for the built-in operations of the language. To make functions fully first-class, the built-ins need to be first class too. (Well, almost first class: there will not be any way to create a new one within the language.) The built-in functions are now refered to as primitive operators, or prim-ops for short. There is no syntax for prim-ops since, as we said, you can't create new ones. As data values they will be stored using the following case of the new datavalue type:
| Cprimop of Operator * int 
where the integer is the arity of the operator.

Data Values We have alluded to the new type for data values several times already. When we first started this project many weeks ago, the data values were numbers. Then we graduated to numbers and strings, and then to s-expressions. The important thing all these types had in common was that they were subtypes of the types already being used in the concrete syntax of the language. That is, they were all part of the underlying syntax. A number as typed by the user (or at least as returned by the parser) is the same as a number as stored in a variable.

Now, however, we have functions as data, and their evaluated values are different from their syntactic representation (closures as opposed to functions). So our existing types are inadequate. Of course, we still need to be able to build pairs (i.e. s-expressions), but at the leaves of the s-expressions (i.e. where the atoms used to be) are values that might be numbers, strings, or symbols (i.e. anything in ConcreteSyntax.atom), or one of the two kinds of functions (a closure or prim-op).

So, the new type definition for AbstractSyntax.datavalue mimics the type definition for ConcreteSyntax.sexp:

datatype datavalue =
    Catom of atom
  | $$ of datavalue * datavalue
  | Cprimop of operator * int
  | Cclosure of vname list * cexpression
                           * (vname * datavalue) list ref

(Note that the leading C on all our data constructors, which we have inherited from the twice-removed calculator interperter, now has only historic significance. I have maintained it as I think you are used to seeing them. This is how strange legacy code that no one can explain comes into being.)

Whereas, during syntax translation in the last project, a quoted object just got wrapped in the Cdata constructor, in this project the structure must first be recursively transformed from an sexp to a datavalue with the same skeletal structure. This conversion is trivial, and consists of just replacing the spine of $ constructors with a spine of $$ constructors. This is handled in the new version of SyntaxConverter. The function AbstractSyntax.datavalue_to_string which was previously set to an apprpriate function from ConcreteSyntax is now defined in full in AbstractSyntax.

A More General Form Of Applica- tion The last major effect of allowing functions to be first-class values is that, since they may now be returned as the results of evaluation, the term in head-position in an application no longer needs to, necessarily, be a named function (built-in or user-defined). Instead it may be an arbitrary expression. Therefore, the case for applications in AbstractSyntax.cexpression becomes:
| Cappl of cexpression * cexpression list 

Evaluating an application is much like before, except that instead of just looking up the definition of the named function in the function list (assuming it wasn't a built-in), you take the term in head-position and evaluate it like any other expression. The result of that evaluation is then applied to the evaluated arguments.

Of course, that means the term in the head can't really be any arbitrary expression. It has to evaluate down to one of the two kinds of "functions": a prim-op, or a closure. If not, a runtime error occurs.

To reiterate, application is now a three step proccess:

  1. Evaluate the expression in head position.
  2. Evaluate the expressions in argument positions.
  3. Assuming the head-position expression evaluated to either a closure or a prim-op, apply the head-position result to the argument position results.
The last step is best handled by defining an auxilliary apply function which takes the head-position result and the list of argument-position results and pattern matches on the type of datavalue in the head-position result, and does the appropriate application (or exception raising) depending on the type.

Primitive Operators Redux So that you don't have to include all the code for evaluating the primitive operators, which is pretty much unchanged from the old code for built-ins, the project has been restructured around a new functor called PrimOpsFun with the following signature:
signature PRIMOPS =
sig
    type datavalue;
    type operator;
    type vname;
 
    exception Runtime_Error of string;
    exception Internal_Error of string;
 
    val TRUE : datavalue;
    val FALSE : datavalue;
    val isFalse : datavalue -> bool;
 
    val apply_primop : operator -> datavalue list -> datavalue
    val initial_environment : (vname * datavalue) list
end;

The values TRUE, FALSE, and the function isFalse, are provided for convenience to coordinate the return values of the relational operators with the values you are expecting/returning in evaluating conditionals.

The function apply_primop takes the place of the old apply_built-in. Note, however, that it still just takes an operator name, not a full prim-op specification. Since it does not receive the arity of the operator, it is your job to check the arity against the number of actual parameters before you call this function. Prim-ops of arbitrary arity (just list in this case) will have a value of -1 in their arity field.

Note that the runtime arity checking is necessary because it cannot always easily be checked at syntax-conversion time, since you might pass a prim-op as an argument to another function in which it is used. For example, you might write:

(defun (f g x y) (g x y)) 
Then you could write:
(defun (add x y) (f + x y)) 
as well as:
(defun (bad x y) (f car x y)) 
Without types to help us, there is no easy way to determine at syntax conversion (compile) time that the second definition should be rejected. Instead we will leave it till run-time to be discovered. (Obviously, the same runtime arity-check will need to be done for closure application as well.)

(In fact, while it is still possible at this point to do some arity checking at syntax-conversion time, the syntax converter for this project does not do any. The reason is that the extra-credit portion of this project, described below, involves a change that effectively means there cannot be compile-time arity errors. Rather than produce two separate syntax converters for this project, it was decided to just do one which does not do any arity checking.)

While this is the solution we have chosen for implementing the primitive operators, because it is conceptually simple, a more interesting solution that would make use of SML's higher-order features would be to define the datatype case for a primop as:

  | Cprimop of (datavalue list -> datavalue) * int

That is, we would store an actual ML function for the primop. That function would be one which does the work of computing the value of that primop on the arguments it is given. Then we would do away with the function apply_built-in, since it would be represented piecemeal in the individual primops themselves.

The Initial Environment The value initial_environment in the signature is just that: an environment that you should use as the environment when you start up your interpreter. It includes bindings of all the built-in function names to the appropriate prim-op structures. Thus there is no special treatment of built-in names. (Well, almost none. See the next paragraph.) When the user types (+ 2 3), syntax-conversion just treats the "+" as a variable name (as it would any other identifier in that position), and converts it to a Cvar. When you evaluate it, like any other variable you will look it up in the environment, where you will find it bound to the appropriate prim-op, which you will then apply to the evaluated arguments (if there are the right number of them). Of course, this means that there is no problem redefining the primitive operators. They are just like any other variables.

There is one exception to the no-special-treatment-for-buit-ins rule that you need to consider. If you look at the definition of initial_environment, you will see that it includes bindings (to prim-ops) for car/cdr abbreviations up to those abbreviating three-operation sequences. This is, in fact, all that many real scheme implementations do, and it is a reasonable compromise. However, I have written the syntax converter so that all such abbreviations are accepted and converted to variables. In order to properly handle them, you should modify the process of looking up a variable in an environment so that if the identifier is not found, and if the identifier looks like a car/cdr abbreviation, then you should construct and return an appropriate prim-op value on the fly. All other undefined variables should cause an Internal_Error to be raised, since they should have been caught during syntax-conversion.

By the way, in Scheme it was standard to name predicate (truth-valued) functions with a trailing question mark. (Some versions of lisp used the name nullp for the function null to make it clear that it was a predicate and not the nil value, but this was pretty ugly.) Therefore, the built-ins null and atom from the last project are now bound to the names null? and atom?. In addition, three more type-testing predicates have been defined: symbol?, string? and number?. These names conform to the standard ones used in real Scheme implementations.

Your Job (Whether Or Not You Choose To Accept It):

 

After all that, there's not much to actually say here. You are to submit new versions of InterpreterFun and EvaluatorFun. Submit them in a single file, they should fall well below the submission-length limit. (Because of the new PrimOpsFun module, you will be stripping about half the code out of teh evaluator functor.)

Most of the changes to InterpreterFun are trivial: mostly they involve removing all mention of the function definition list, since we don't need it anymore. The only interesting change (which is the reason I left this to you) is the change to the handling of Cdefun commands. (Oh, and don't forget to account for the exceptions that apply_primop might raise.)

The definition of EvaluatorFun will need pretty extensive changes all having to do with properly handling the various kinds of first-class functions.

How About A Little Extra Credit?

 

Consider the following definitions:

(defun (add x y) (+ x y))
(defun (make_adder n) (add n))
As the system is specified at this point, if make_adder was ever applied to an argument, it would generate a runtime error stating that add was applied to too few arguments.

It would be nice, however, to recognize that we really wish to build a partial application, and, at that point, curry the definiton of add as much as necessary to build the partial application, and then return a new closure for the remainder. (I.e., a call to make_adder would return a closure of one argument).

Similarly, once we support this change, it would be reasonable to assume that application associates to the left as in ML. Thus the interpreter should view the following as equivalent:

(define seven ((make_adder 3) 4))
(define seven (make_adder 3 4))
Thus it should be syntactically acceptable to apply a function to more arguments than it is defined to take, under the assumption that after using its defined number of arguments the function will return a function which will then be applied to the rest. This could, of course, lead to a runtime error if the function does not return some sort of function.

Of course, there is no reason to restrict this treatment to user defined functions. It is just as reasonable to define make_adder as:

(defun (make_adder n) (+ n)) 
without the intermediate definition of add.

For extra-credit, rewrite the evaluator so that this automated currying/uncurrying occurs.

Note that, in order to make the syntax well-behaved, it is necessary to assume that certain "misuses" of a function are still wrong. In particular, applying a function of non-zero-arity to no arguments is wrong (since it does not accomplish anything), and applying a zero-arity function to arguments is also bad (since it makes the syntax for function application somewhat ambiguous). Both of these should generate runtime errors. (Recall, this is why ML does not have zero-arity functions, and instead has unit; it avoids this ambiguity.)


This page copyright ©1998 by Joshua S. Hodas. It was built on a Macintosh. Last rebuilt on Fri, Apr 10, 1999 at 3:21:15 PM.
http://cs.hmc.edu/~hodas/courses/cs131/projects/project08.html