Inductive Definitions
Languages,
Grammars,
Parsing
Motivation: Parsing
Parsing is the act of turning text into meaningful information.
Example:
Programming language: Parsing makes the language into an executable machine language program.
Calculator: Parsing interprets the symbols to carry out the calculation being represented.
345 + 62.7*84.9
doesnÕt have a ÒmagicalÓ meaning; we have to give it one.

Grammars, and Induction
Grammars provide a plan for parsing;
they define the syntax of a language.
Grammars are an instance of a more general concept: Inductive Definitions.
rex rules are often inductive definitions; but grammars may be non-deterministic for a reason.

Inductive Definitions
Inductive definitions are the main ÒconstructiveÓ way to define infinite sets.
We will need infinite sets in much of what follows.

Inductive Definitions
Elements of an inductive definition of a set S.
Basis
Induction rule(s)
Extremal clause

Inductive Definitions
Elements of an inductive definition of a set S:
Basis: Defines a few items to be in S.
Induction rule(s): Introduce new items in S based on existence of other, usually simpler, items.
Extremal clause: Says that the only items in S are those derivable by the previous two elements, applied any finite numer of times.

Example of ID: Binary Trees
   is a binary tree.
If T1 and T2 are binary trees, then so is:
Extremal clause: The only binary trees are those constructible by a finite number of applications of the above rules.

Examples of Binary Trees
Example of ID: Natural Numbers w
Basis: 0 is in w.
Induction: If n is in w, so is the successor of n (variously denoted nÕ, S(n), or n+1).
Extremal: The only elements in w are those derivable by applications of the above rules.
Examples: 0, 0Õ, 0ÕÕ, 0ÕÕÕ, 0ÕÕÕÕ, É are all elements of w.

Notes
w is an infinite set.
wÕs members are all finite.
w does not contain infinity (´) as an element.

Interpretations of Successor (Õ)
What are 0Õ, 0ÕÕ, 0ÕÕÕ, É really?
Strings of symbols, or
Things that can be constructed from sets, a more primitive concept.
Two variations:
0 is { }, the empty set; XÕ is the set {X}, or
0 is { }, the empty set; XÕ is the set X é {X}.
In the second example: 0 is {}, 0Õ is {{}},
0ÕÕ is {{}, {{}}}, 0ÕÕÕ is {{}, {{}}, {{}, {{}}}}, É
Advantage: 0n Õs is a set with n distinct members.

Decimal Numerals
We can agree by convention that
1 stands for 0Õ,
2 stands for 0ÕÕ,
É
9 stands for 0ÕÕÕÕÕÕÕÕÕ.
Beyond that, give an algorithm for generating additional numerals:
10, 11, 12, 13, ...

Decimal Numbering Rule
The successor of x0 (concatenation) is x1, the successor of x1 is x2, É, and the successor of x8 is x9.
The successor of x9 is y0 where y is the successor of x.
Example: 0, 1, É, 9, 10, 11, É, 19, 20, 21, É, 99, 100, ...

1-adic Numerals
The only digit is 1.
The empty string (denoted l so it is readable) stands for 0.
1X (1 followed by X) stands for XÕ.
The numerals are:
l, 1, 11, 111, 1111, 1111, É
Could also use lists: [ ], [1], [1, 1], [1, 1, 1], ...

2-adic Numerals
The digits are 1 and 2.
The empty string (denoted l so it is visible) stands for 0.
The numerals are:
l, 1, 2, 11, 12, 21, 22, 111, 112, É
Unlike binary numerals, there is no redundancy (1, 01, 001, 0001, É all mean the same thing in binary).

Roman Numerals
The digits are I, V, X, L, C, D, M.
There is no string for 0.
The successor of I is s(I) = II, s(II) = III, s(III) = IV, etc.

Numerals vs. Numbers
Numbers are abstract.
Numerals are a concrete representation of numbers.

Strings over an alphabet S
The set of all finite strings over an alphabet S is denoted S*.
Example:
{a, b}* =
{
l, a, b, aa, ab, ba, bb, aaa, aab, aba, É}

Strings over an alphabet S
Basis: l, the empty string, is in S*.
Inductive rule: If x ë S* and s ë S, then s x (s followed by x) is in S*.
Extremal clause.

Languages
A language over S is any subset of S*, the set of all strings over S.
Examples, where S = {a, b}:
{a, b}* itself
{ }, the empty language
{ba, baba}, maybe your first language
{l, aa, aaaa, aaaa, aaaaaa, É} the language of an even number of aÕs.

More Languages
More Examples, where S = {a, b}
{l, ab, ba, aabb, abab, baab, bbaa, aaabbb, aababb, É} the language in which the number of aÕs equals the number of bÕs.
{a, b, aa, bb, aab, aba, baa, abb, bab, bba, É} the language in which the number of aÕs is not equal to the number of bÕs.
{ab, abab, aabb, aababb, É} a language you might recognize.

Languages
There are lots of languages, some very weird.
To be of computational interest, a language needs to be defined inductively.
We need a way of telling whether a given string is in the language or not
(called parsing the string).

Non-Trivial Language
Defined Inductively
L = {ab, abab, aabb, aababb, É}
Basis: ab is in L.
Inductive rules:
If x is in L, so is axb.
If x1 and x2 are in L, so is x1x2.

Grammars: A Shorthand
Spelling everything out with these inductive definitions is laborious.
We need a shorthand, especially for more complex languages.
The idea comes from linguistics  and early work on computer languages.

Grammar Definition
There is a Òstart symbolÓ, or ÒrootÓ, say S, which is not in the alphabet of the language itself.
¨ is a symbol meaning Òcan be rewritten asÓ.
Grammar rules, for example:
S ¨ ab
S ¨ aSb
S ¨ SS
Application of rules is by Òfree choiceÓ.
A sequence of applications is called a derivation.
The strings in the language are those that donÕt include S.

Using the Grammar Rules
Grammar rules:
S ¨ ab
S ¨ aSb
S ¨ SS
Example derivations of strings in the language:
S Þ ab
S Þ aSb Þ aabb
S Þ aSb Þ aaSbb Þ aaabbb
S Þ SS Þ abS Þ abab
S Þ SS Þ SSS Þ ababab
S Þ SS Þ aSbS Þ aabbS Þ aabbaSb Þ aabbaabb

Generalizing Grammar Rules
Instead of just S, allow multiple symbols, called auxiliaries, none of which are in the alphabet of the language.
A distinguished auxiliary is called the root or Òstart symbolÓ.
The symbols in the alphabet of the language are called terminals.
The rules are known as productions.

Example:
Grammar for Additive Arithmetic Expressions
The root is A.
The terminals are {a, b, c, +}.
The productions are:
A ¨ V
A ¨ V + A
V ¨ a
V ¨ b
V ¨ c

Example Derivations
The productions are:
A ¨ V
A ¨ V + A
Sample derivations:
A Þ V Þ a
A Þ V Þ c
A Þ V + A Þ c + A Þ c + V Þ c + a
A Þ V + A Þ c + A Þ c + V + A Þ c + b + A Þ c + b + V Þ c + b + a

Shorthands on top of Shorthands
The productions are:
A ¨ V
A ¨ V + A
Group by common left-hand sides
Use | (read ÒorÓ) to represent alternatives:
A ¨ V  |  V + A
V ¨ a  |  b  |  c
Note: | Òbinds more looselyÓ than other symbols.
Same grammar, just a briefer notation.

Derivation Tree Visualization
Syntax Tree (!= Derivation Tree)
Shows Implied ÒInterpretation" of String
Right Grouping (used so far) vs.
Left Grouping Productions
Does Grouping Matter?
Mathematically, + is an associative operator:
(a + b) + c == a + (b + c)
However:
There are non-associative operators, such as -, where it does matter.
(a - b) - c  ?=  a - (b - c)
Also, on computers, for floating point addition, associativity does not always hold.

Floating Point is Not Associative
Try this:
sumup(m, n)   = m > n ? 0 : 1./m + sumup(m+1, n);
sumdown(m, n) = m > n ? 0 : 1./n + sumdown(m, n-1);
test(n) = sumup(1, n) == sumdown(1, n);
map(test, range(1, 100));
[1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1,
 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1,
 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0,
 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]
Grouping sensitivity is due to round-off error.

Precedence Issue
(multiple operator symbols)
How do we ensure that the syntax tree of
a + b*c
looks like this: and not this:

Multiple Auxiliaries
We want * to Ò bind more tightlyÓ than +.
Use a different auxiliary symbol for each level of precedence.
Arrange it so that expansions from the tighter binding auxiliary symbol can only be done after those of the looser binding auxiliary.

Precedence Issue
We must ensure that the derivation tree for a+b*c looks like this: and not this:

Example:
Grammar for Additive & Multiplicative
 Arithmetic Expressions
The root is A.
The terminals are {a, b, c, +, *}.
The productions are:
A ¨ M + A  |  M
M ¨ V * M  |  V
V ¨ a  |  b  |  c
Intuitive rule: Operator Òfarther from the rootÓ binds more tightly

Syntactic Categories
The various auxiliary symbols typically represent syntactic categories: sets of sub-expressions having a certain type of meaning.
Categories:
A ¨ M + A  |  M S is a ÒsumÓ
M ¨ V * M  |  V P is a ÒproductÓ
V ¨ a  |  b  |  c V is a ÒvariableÓ

Example Derivations
The productions are:
A ¨ M + A  |  M
M ¨ V * M  |  V
V ¨ a  |  b  |  c
Sample derivations (A is the syntactic category):
A Þ M  Þ V Þ a
A Þ M + A  Þ V + A Þ a + A Þ a + M Þ a + V Þ a + b
A Þ M + A  Þ V + A Þ a + A Þ a + M Þ a + V*M
Þ a + b * M Þ a + b * c
A Þ M + A  Þ V*M + A Þ a *M + A Þ a *V + A
Þ a *b + A Þ a *b + M Þ a *b + V Þ a *b + c

Example Syntactic Categories
The productions are:
A ¨ M + A  |  M
M ¨ V * M  |  V
V ¨ a  |  b  |  c
Sample sub-derivations, e.g. from M:
M  Þ V Þ a
M  Þ V*M Þ a*M Þ a*V Þ a*b
M  Þ V*M Þ a*M Þ a*V*M Þ a*b*M Þ a*b*V Þa*b*a
Observation: Derivations from M will never include any +Õs.

Exercise: Include ^ (power)
^ binds the most tightly
* is next
+ is the weakest

How to handle Ô(Õ     Ô)Õ
Parentheses means
Òhandle inside as a single unitÓ
Parallel level to a single variable
Sometimes called ÒprimariesÓ

Two Main Language Problems
Recognition problem:
Is a given string in the language?
Meaning problem:
What is the meaning of a string if
it is in the language?

Na•ve Solution to the
Recognition Problem
To determine whether string x is in the language generated by a grammar:
Start with the start symbol.
Generate strings successively by applying productions.
Eventually either:
The string x is generated, or
The new strings being generated all exceed x in length.
So we can tell whether or not x is ever generated.

Parsing
Parsing seeks to solve both problems:
Recognition
Meaning
In addition, it tries to do recognition much more efficiently than the na•ve solution.

Recursive Descent Parsing
Simplest reasonably general form of parsing.
Works for many, but not all grammars.
Sometimes a grammar can be transformed to enable recursive descent.
Recall that each each auxiliary symbol in the grammar can be identified with a syntactic category, the set of strings that can be generated from that symbol (possibly with the help of other symbols). The meaning will derive from this idea.

Recursive Descent
ItÕs called ÒrecursiveÓ because in general grammar productions can ÒcallÓ themselves or each other.
ItÕs called ÒdescentÓ because parsing starts at the root of a Òderivation treeÓ and proceeds toward the leaves.

Parse Methods
For each auxiliary symbol in the grammar, construct a parse method
Each parse methodÕs responsibility is to recognize the longest string in the corresponding syntactic category in the remainder of the input, from the current point onward:
a + b * c

Example
Consider the grammar with start symbol S:
S ¨ V + S | V
V ¨ a  |  b  |  c
The parse begins by trying to identify the entire input string as being in syntactic category S.
Clearly it must find a V to start.
To find a V, it checks to see whether the next symbol is one of those listed.
Having found a V, it checks to see if the next symbol is +.
If so, it recurses, trying to find another S.
If not it stops.
After the top call to S returns, it checks to see whether there are any spurious remaining characters in the input.
If there are, the input is not accepted.
If not, the input is accepted.

Example: Success
Suppose the input string is Òa + b + cÓ.
Subscripts will indicate the particular instance of the method and the ÒargumentÓ will indicate the unparsed remainder of the input.
The parser calls S1(Òa + b + cÓ).
S1 calls V1(Òa + b + cÓ).
V1 identifies a, returns success and unparsed input Ò+ b + cÓ.
S1 checks for + and finds it; therefore S1 calls S2(Òb + cÓ).
S2 calls V2(Òb + cÓ).
V2 identifies b, returns success and unparsed input Ò+ cÓ.
S2 checks for + and finds it; therefore S2 calls S3(ÒcÓ).
S3 calls V3(ÒcÓ).
V3 identifies c, returns success and unparsed input ÒÓ.
S3 checks for + and does not find it; therefore S3 returns success with ÒÓ.
S2 returns success with ÒÓ.
S1 returns success with ÒÓ. The string is accepted.

Example: Failure
Suppose the input string is Òa b + cÓ.
The parser calls S1(Òa b + cÓ).
S1 calls V1(Òa b + cÓ).
V1 identifies a, returns success and unparsed input Òb + cÓ.
S1 checks for + and does not find it; therefore S1 returns success, with Òb + cÓ.
Since the top-level call to S1 has returned, but there is residual input, the string is not accepted.

A rex version of parsing
Each syntactic category will be a rex function.
There is one argument:
the unparsed input, a list of characters.
There are two results:
success or failure indicator
for success: the Syntax Tree
for failure: FAILURE (some special value, not a syntax tree)
the unparsed input.

A rex version of parsing (1)
Test cases
A rex version of parsing (2)
Operators + and *  
with * having higher precedence
Rules:
A ¨ M + A  |  M
M ¨ V * M  |  V
V ¨ a  |  b  |  c
Note that * is analogous to +.
A is to M and + as
M is to V and *
Therefore the same rule pattern applies to both.

rex parsing for +, * (A)
rex parsing for +, * (M)
Parsing Methods in Java
In the Java version, we will Ònot need toÓ return the unparsed input as a value.
We can side-effect the input stream to achieve a similar result, Òusing upÓ characters as we go.
We can store the input stream in the parse object, rather than pass it as an argument.

Parsing Methods in Java
Additive Grammar
Runnable Examples
V() method
makeString(char c)
isVar()
Recursive A() method
Replacing some Recursion
with Iteration
ÒInverse McCarthy TransformationÓ
 for Grammars with left-grouping
Recursion ¨ Iteration
Works in some cases, not all
Use for convenience and readability

A() method, iterative version
The Additive/Multiplicative Grammar
Remembering Precedence Rules
Tighter-binding operators are introduce further away from the root of the grammar:

Syntax Tree Applet
Example: SimpleCalc
Parses numeric expressions with +, *, ()
Computes the numeric answer
Same grammar as SyntaxTree applet

Slide 76
Grammar for Unicalc
Example
3.5 meters^2 / (watt hour)
Operators
^
/
juxtaposition (implied multiplication)
Units (meter, second, etc.)
Numbers (floating point allowed: 1.23e-45)
Parentheses

Result of Parsing Unicalc
A Unicalc Quantity:
Object with 3 components:
numeric multiplier
numerator
denominator
The parser may perform some ÒalgebraÓ:
^ gets converted to multiplication
/ and juxtaposition use Quantity divide and multiply