Inductive Definitions
Languages,
Grammars,
Parsing

Motivation: Parsing


	Parsing is the act of turning text into meaningful information.
	Example:
		Programming language: Parsing makes the language into an executable machine language program.

		Calculator: Parsing interprets the symbols to carry out the calculation being represented. 345 + 62.7*84.9 doesnÕt have a ÒmagicalÓ meaning; we have to give it one.

Grammars, and Induction


	Grammars provide a plan for parsing; they define the syntax of a language.

	Grammars are an instance of a more general concept: Inductive Definitions.

	rex rules are often inductive definitions; but grammars may be non-deterministic for a reason.

Inductive Definitions


	Inductive definitions are the main ÒconstructiveÓ way to define infinite sets.

	We will need infinite sets in much of what follows.

Inductive Definitions


	Elements of an inductive definition of a set S.
		Basis
		Induction rule(s)
		Extremal clause

Inductive Definitions


	Elements of an inductive definition of a set S:
		Basis: Defines a few items to be in S.

		Induction rule(s): Introduce new items in S based on existence of other, usually simpler, items.

		Extremal clause: Says that the only items in S are those derivable by the previous two elements, applied any finite numer of times.

Example of ID: Binary Trees


	is a binary tree.
	If T₁ and T₂ are binary trees, then so is:

	Extremal clause: The only binary trees are those constructible by a finite number of applications of the above rules.

Examples of Binary Trees

Example of ID: Natural Numbers w


	Basis: 0 is in w.
	Induction: If n is in w, so is the successor of n (variously denoted nÕ, S(n), or n+1).
	Extremal: The only elements in w are those derivable by applications of the above rules.
	Examples: 0, 0Õ, 0ÕÕ, 0ÕÕÕ, 0ÕÕÕÕ, É are all elements of w.

Notes


	w is an infinite set.
	wÕs members are all finite.
	w does not contain infinity (´) as an element.

Interpretations of Successor (Õ)


What are 0Õ, 0ÕÕ, 0ÕÕÕ, É really?
	Strings of symbols, or
	Things that can be constructed from sets, a more primitive concept.
		Two variations:
			0 is { }, the empty set; XÕ is the set {X}, or
			0 is { }, the empty set; XÕ is the set X é {X}.
		In the second example: 0 is {}, 0Õ is {{}}, 0ÕÕ is {{}, {{}}}, 0ÕÕÕ is {{}, {{}}, {{}, {{}}}}, É
		Advantage: 0ⁿÕ^s is a set with n distinct members.

Decimal Numerals


	We can agree by convention that
		1 stands for 0Õ,
		2 stands for 0ÕÕ,
		É
		9 stands for 0ÕÕÕÕÕÕÕÕÕ.
		Beyond that, give an algorithm for generating additional numerals: 10, 11, 12, 13, ...

Decimal Numbering Rule


	The successor of x0 (concatenation) is x1, the successor of x1 is x2, É, and the successor of x8 is x9.
	The successor of x9 is y0 where y is the successor of x.
	Example: 0, 1, É, 9, 10, 11, É, 19, 20, 21, É, 99, 100, ...

1-adic Numerals


	The only digit is 1.
	The empty string (denoted l so it is readable) stands for 0.
	1X (1 followed by X) stands for XÕ.
	The numerals are: l, 1, 11, 111, 1111, 1111, É

	Could also use lists: [ ], [1], [1, 1], [1, 1, 1], ...

2-adic Numerals


	The digits are 1 and 2.
	The empty string (denoted l so it is visible) stands for 0.
	The numerals are: l, 1, 2, 11, 12, 21, 22, 111, 112, É
	Unlike binary numerals, there is no redundancy (1, 01, 001, 0001, É all mean the same thing in binary).

Roman Numerals


	The digits are I, V, X, L, C, D, M.
	There is no string for 0.
	The successor of I is s(I) = II, s(II) = III, s(III) = IV, etc.

Numerals vs. Numbers


	Numbers are abstract.

	Numerals are a concrete representation of numbers.

Strings over an alphabet S


	The set of all finite strings over an alphabet S is denoted S*.
	Example:
		{a, b}* = {l, a, b, aa, ab, ba, bb, aaa, aab, aba, É}

Strings over an alphabet S


	Basis: l, the empty string, is in S*.
	Inductive rule: If x ë S* and s ë S, then s x (s followed by x) is in S*.
	Extremal clause.

Languages


	A language over S is any subset of S*, the set of all strings over S.
	Examples, where S = {a, b}:
		{a, b}* itself
		{ }, the empty language
		{ba, baba}, maybe your first language
		{l, aa, aaaa, aaaa, aaaaaa, É} the language of an even number of aÕs.

More Languages


	More Examples, where S = {a, b}
		{l, ab, ba, aabb, abab, baab, bbaa, aaabbb, aababb, É} the language in which the number of aÕs equals the number of bÕs.
		{a, b, aa, bb, aab, aba, baa, abb, bab, bba, É} the language in which the number of aÕs is not equal to the number of bÕs.
		{ab, abab, aabb, aababb, É} a language you might recognize.

Languages


	There are lots of languages, some very weird.
	To be of computational interest, a language needs to be defined inductively.
	We need a way of telling whether a given string is in the language or not (called parsing the string).

Non-Trivial Language
Defined Inductively


	L = {ab, abab, aabb, aababb, É}
	Basis: ab is in L.
	Inductive rules:
		If x is in L, so is axb.
		If x₁ and x₂ are in L, so is x₁x₂.

Grammars: A Shorthand


	Spelling everything out with these inductive definitions is laborious.
	We need a shorthand, especially for more complex languages.

	The idea comes from linguistics and early work on computer languages.

Grammar Definition


	There is a Òstart symbolÓ, or ÒrootÓ, say S, which is not in the alphabet of the language itself.
	¨ is a symbol meaning Òcan be rewritten asÓ.
	Grammar rules, for example:
		S ¨ ab
		S ¨ aSb
		S ¨ SS
	Application of rules is by Òfree choiceÓ.
	A sequence of applications is called a derivation.
	The strings in the language are those that donÕt include S.

Using the Grammar Rules


	Grammar rules:
		S ¨ ab
		S ¨ aSb
		S ¨ SS
	Example derivations of strings in the language:
		S Þ ab
		S Þ aSb Þ aabb
		S Þ aSb Þ aaSbb Þ aaabbb
		S Þ SS Þ abS Þ abab
		S Þ SS Þ SSS Þ ababab
		S Þ SS Þ aSbS Þ aabbS Þ aabbaSb Þ aabbaabb

Generalizing Grammar Rules


	Instead of just S, allow multiple symbols, called auxiliaries, none of which are in the alphabet of the language.
	A distinguished auxiliary is called the root or Òstart symbolÓ.
	The symbols in the alphabet of the language are called terminals.
	The rules are known as productions.

Example:
Grammar for Additive Arithmetic Expressions


	The root is A.
	The terminals are {a, b, c, +}.
	The productions are:
		A ¨ V
		A ¨ V + A
		V ¨ a
		V ¨ b
		V ¨ c

Example Derivations


	The productions are:
		A ¨ V
		A ¨ V + A

	Sample derivations:
		A Þ V Þ a
		A Þ V Þ c
		A Þ V + A Þ c + A Þ c + V Þ c + a
		A Þ V + A Þ c + A Þ c + V + A Þ c + b + A Þ c + b + V Þ c + b + a

Shorthands on top of Shorthands


	The productions are:
		A ¨ V
		A ¨ V + A

	Group by common left-hand sides
	Use \| (read ÒorÓ) to represent alternatives:
		A ¨ V \| V + A
		V ¨ a \| b \| c
	Note: \| Òbinds more looselyÓ than other symbols.
	Same grammar, just a briefer notation.

Derivation Tree Visualization

Syntax Tree (!= Derivation Tree)
Shows Implied ÒInterpretation" of String

Right Grouping (used so far) vs.
Left Grouping Productions

Does Grouping Matter?


Mathematically, + is an associative operator: (a + b) + c == a + (b + c)
However:
	There are non-associative operators, such as -, where it does matter.
		(a - b) - c ?= a - (b - c)
	Also, on computers, for floating point addition, associativity does not always hold.

Floating Point is Not Associative


	Try this:
		sumup(m, n) = m > n ? 0 : 1./m + sumup(m+1, n);
		sumdown(m, n) = m > n ? 0 : 1./n + sumdown(m, n-1);
		test(n) = sumup(1, n) == sumdown(1, n);

		map(test, range(1, 100));

		[1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]

	Grouping sensitivity is due to round-off error.

Precedence Issue
(multiple operator symbols)


	How do we ensure that the syntax tree of a + b*c looks like this: and not this:

Multiple Auxiliaries


	We want * to Ò bind more tightlyÓ than +.
	Use a different auxiliary symbol for each level of precedence.
	Arrange it so that expansions from the tighter binding auxiliary symbol can only be done after those of the looser binding auxiliary.

Precedence Issue


	We must ensure that the derivation tree for a+b*c looks like this: and not this:

Example:
Grammar for Additive & Multiplicative
Arithmetic Expressions


	The root is A.
	The terminals are {a, b, c, +, *}.
	The productions are:
		A ¨ M + A \| M
		M ¨ V * M \| V
		V ¨ a \| b \| c
	Intuitive rule: Operator Òfarther from the rootÓ binds more tightly

Syntactic Categories


	The various auxiliary symbols typically represent syntactic categories: sets of sub-expressions having a certain type of meaning.
	Categories:
		A ¨ M + A \| M S is a ÒsumÓ
		M ¨ V * M \| V P is a ÒproductÓ
		V ¨ a \| b \| c V is a ÒvariableÓ

Example Derivations


	The productions are:
		A ¨ M + A \| M
		M ¨ V * M \| V
		V ¨ a \| b \| c
	Sample derivations (A is the syntactic category):
		A Þ M Þ V Þ a
		A Þ M + A Þ V + A Þ a + A Þ a + M Þ a + V Þ a + b
		A Þ M + A Þ V + A Þ a + A Þ a + M Þ a + VM Þ a + b M Þ a + b * c
		A Þ M + A Þ VM + A Þ a M + A Þ a V + A Þ a b + A Þ a b + M Þ a b + V Þ a *b + c

Example Syntactic Categories


	The productions are:
		A ¨ M + A \| M
		M ¨ V * M \| V
		V ¨ a \| b \| c
	Sample sub-derivations, e.g. from M:
		M Þ V Þ a
		M Þ VM Þ aM Þ aV Þ ab
		M Þ VM Þ aM Þ aVM Þ abM Þ abV Þaba
	Observation: Derivations from M will never include any +Õs.

Exercise: Include ^ (power)


	^ binds the most tightly
	* is next
	+ is the weakest

How to handle Ô(Õ Ô)Õ


	Parentheses means Òhandle inside as a single unitÓ
	Parallel level to a single variable
		Sometimes called ÒprimariesÓ

Two Main Language Problems


	Recognition problem: Is a given string in the language?

	Meaning problem: What is the meaning of a string if it is in the language?

Na•ve Solution to the
Recognition Problem


To determine whether string x is in the language generated by a grammar:
	Start with the start symbol.
	Generate strings successively by applying productions.
	Eventually either:
		The string x is generated, or
		The new strings being generated all exceed x in length.
	So we can tell whether or not x is ever generated.

Parsing


	Parsing seeks to solve both problems:
		Recognition
		Meaning
	In addition, it tries to do recognition much more efficiently than the na•ve solution.

Recursive Descent Parsing


	Simplest reasonably general form of parsing.
	Works for many, but not all grammars.
	Sometimes a grammar can be transformed to enable recursive descent.
	Recall that each each auxiliary symbol in the grammar can be identified with a syntactic category, the set of strings that can be generated from that symbol (possibly with the help of other symbols). The meaning will derive from this idea.

Recursive Descent


	ItÕs called ÒrecursiveÓ because in general grammar productions can ÒcallÓ themselves or each other.
	ItÕs called ÒdescentÓ because parsing starts at the root of a Òderivation treeÓ and proceeds toward the leaves.

Parse Methods


	For each auxiliary symbol in the grammar, construct a parse method
	Each parse methodÕs responsibility is to recognize the longest string in the corresponding syntactic category in the remainder of the input, from the current point onward:
	a + b * c

Example


	Consider the grammar with start symbol S:
		S ¨ V + S \| V
		V ¨ a \| b \| c
	The parse begins by trying to identify the entire input string as being in syntactic category S.
	Clearly it must find a V to start.
		To find a V, it checks to see whether the next symbol is one of those listed.
	Having found a V, it checks to see if the next symbol is +.
		If so, it recurses, trying to find another S.
		If not it stops.
	After the top call to S returns, it checks to see whether there are any spurious remaining characters in the input.
		If there are, the input is not accepted.
		If not, the input is accepted.

Example: Success


	Suppose the input string is Òa + b + cÓ.
	Subscripts will indicate the particular instance of the method and the ÒargumentÓ will indicate the unparsed remainder of the input.
	The parser calls S₁(Òa + b + cÓ).
	S₁ calls V₁(Òa + b + cÓ).
	V₁ identifies a, returns success and unparsed input Ò+ b + cÓ.
	S₁ checks for + and finds it; therefore S₁ calls S₂(Òb + cÓ).
	S₂ calls V₂(Òb + cÓ).
	V₂ identifies b, returns success and unparsed input Ò+ cÓ.
	S₂ checks for + and finds it; therefore S₂ calls S₃(ÒcÓ).
	S₃ calls V₃(ÒcÓ).
	V₃ identifies c, returns success and unparsed input ÒÓ.
	S₃ checks for + and does not find it; therefore S₃ returns success with ÒÓ.
	S₂ returns success with ÒÓ.
	S₁ returns success with ÒÓ. The string is accepted.

Example: Failure


	Suppose the input string is Òa b + cÓ.
	The parser calls S₁(Òa b + cÓ).
	S₁ calls V₁(Òa b + cÓ).
	V₁ identifies a, returns success and unparsed input Òb + cÓ.
	S₁ checks for + and does not find it; therefore S₁ returns success, with Òb + cÓ.
	Since the top-level call to S₁ has returned, but there is residual input, the string is not accepted.

A rex version of parsing


Each syntactic category will be a rex function.
There is one argument:
	the unparsed input, a list of characters.
There are two results:
	success or failure indicator
		for success: the Syntax Tree
		for failure: FAILURE (some special value, not a syntax tree)
	the unparsed input.

A rex version of parsing (1)

Test cases

A rex version of parsing (2)

Operators + and *
with * having higher precedence


	Rules:
		A ¨ M + A \| M
		M ¨ V * M \| V
		V ¨ a \| b \| c
	Note that * is analogous to +.
		A is to M and + as M is to V and *
	Therefore the same rule pattern applies to both.

rex parsing for +, * (A)

rex parsing for +, * (M)

Parsing Methods in Java


	In the Java version, we will Ònot need toÓ return the unparsed input as a value.
	We can side-effect the input stream to achieve a similar result, Òusing upÓ characters as we go.

	We can store the input stream in the parse object, rather than pass it as an argument.

Parsing Methods in Java

Additive Grammar

Runnable Examples

V() method

makeString(char c)

isVar()

Recursive A() method

Replacing some Recursion
with Iteration

ÒInverse McCarthy TransformationÓ
for Grammars with left-grouping


	Recursion ¨ Iteration
	Works in some cases, not all
	Use for convenience and readability

A() method, iterative version

The Additive/Multiplicative Grammar

Remembering Precedence Rules


	Tighter-binding operators are introduce further away from the root of the grammar:

Syntax Tree Applet

Example: SimpleCalc


	Parses numeric expressions with +, *, ()
	Computes the numeric answer
	Same grammar as SyntaxTree applet

Slide 76

Grammar for Unicalc


	Example
		3.5 meters^2 / (watt hour)
	Operators
		^
		/
		juxtaposition (implied multiplication)
	Units (meter, second, etc.)
	Numbers (floating point allowed: 1.23e-45)
	Parentheses

Result of Parsing Unicalc


	A Unicalc Quantity: Object with 3 components:
		numeric multiplier
		numerator
		denominator
	The parser may perform some ÒalgebraÓ:
		^ gets converted to multiplication
		/ and juxtaposition use Quantity divide and multiply


	The various auxiliary symbols typically represent syntactic categories: sets of sub-expressions having a certain type of meaning.
	Categories:
		A ¨ M + A \| M S is a ÒsumÓ
		M ¨ V * M \| V P is a ÒproductÓ
		V ¨ a \| b \| c V is a ÒvariableÓ


	The productions are:
		A ¨ M + A \| M
		M ¨ V * M \| V
		V ¨ a \| b \| c
	Sample derivations (A is the syntactic category):
		A Þ M Þ V Þ a
		A Þ M + A Þ V + A Þ a + A Þ a + M Þ a + V Þ a + b
		A Þ M + A Þ V + A Þ a + A Þ a + M Þ a + VM Þ a + b M Þ a + b * c
		A Þ M + A Þ VM + A Þ a M + A Þ a V + A Þ a b + A Þ a b + M Þ a b + V Þ a *b + c


	The productions are:
		A ¨ M + A \| M
		M ¨ V * M \| V
		V ¨ a \| b \| c
	Sample sub-derivations, e.g. from M:
		M Þ V Þ a
		M Þ VM Þ aM Þ aV Þ ab
		M Þ VM Þ aM Þ aVM Þ abM Þ abV Þaba
	Observation: Derivations from M will never include any +Õs.