| Inductive Definitions Languages, Grammars, Parsing |
| Motivation: Parsing |
| Parsing is the act of turning text into meaningful information. | ||
| Example: | ||
| Programming language: Parsing makes the language into an executable machine language program. | ||
| Calculator: Parsing interprets the
symbols to carry out the calculation being represented. 345 + 62.7*84.9 doesnÕt have a ÒmagicalÓ meaning; we have to give it one. |
||
| Grammars, and Induction |
| Grammars provide a plan for
parsing; they define the syntax of a language. |
|
| Grammars are an instance of a more general concept: Inductive Definitions. | |
| rex rules are often inductive definitions; but grammars may be non-deterministic for a reason. |
| Inductive Definitions |
| Inductive definitions are the main ÒconstructiveÓ way to define infinite sets. | |
| We will need infinite sets in much of what follows. |
| Inductive Definitions |
| Elements of an inductive definition of a set S. | ||
| Basis | ||
| Induction rule(s) | ||
| Extremal clause | ||
| Inductive Definitions |
| Elements of an inductive definition of a set S: | ||
| Basis: Defines a few items to be in S. | ||
| Induction rule(s): Introduce new items in S based on existence of other, usually simpler, items. | ||
| Extremal clause: Says that the only items in S are those derivable by the previous two elements, applied any finite numer of times. | ||
| Example of ID: Binary Trees |
| is a binary tree. | |
| If T1 and T2 are binary trees, then so is: | |
| Extremal clause: The only binary trees are those constructible by a finite number of applications of the above rules. |
| Examples of Binary Trees |
| Example of ID: Natural Numbers w |
| Basis: 0 is in w. | |
| Induction: If n is in w, so is the successor of n (variously denoted nÕ, S(n), or n+1). | |
| Extremal: The only elements in w are those derivable by applications of the above rules. | |
| Examples: 0, 0Õ, 0ÕÕ, 0ÕÕÕ, 0ÕÕÕÕ, É are all elements of w. |
| Notes |
| w is an infinite set. | |
| wÕs members are all finite. | |
| w does not contain infinity (´) as an element. |
| Interpretations of Successor (Õ) |
| What are 0Õ, 0ÕÕ, 0ÕÕÕ, É really? | ||||
| Strings of symbols, or | ||||
| Things that can be constructed from sets, a more primitive concept. | ||||
| Two variations: | ||||
| 0 is { }, the empty set; XÕ is the set {X}, or | ||||
| 0 is { }, the empty set; XÕ is the set X é {X}. | ||||
| In the second example: 0 is {}, 0Õ is
{{}}, 0ÕÕ is {{}, {{}}}, 0ÕÕÕ is {{}, {{}}, {{}, {{}}}}, É |
||||
| Advantage: 0n Õs is a set with n distinct members. | ||||
| Decimal Numerals |
| We can agree by convention that | ||
| 1 stands for 0Õ, | ||
| 2 stands for 0ÕÕ, | ||
| É | ||
| 9 stands for 0ÕÕÕÕÕÕÕÕÕ. | ||
| Beyond that, give an algorithm for
generating additional numerals: 10, 11, 12, 13, ... |
||
| Decimal Numbering Rule |
| The successor of x0 (concatenation) is x1, the successor of x1 is x2, É, and the successor of x8 is x9. | |
| The successor of x9 is y0 where y is the successor of x. | |
| Example: 0, 1, É, 9, 10, 11, É, 19, 20, 21, É, 99, 100, ... |
| 1-adic Numerals |
| The only digit is 1. | |
| The empty string (denoted l so it is readable) stands for 0. | |
| 1X (1 followed by X) stands for XÕ. | |
| The numerals are: l, 1, 11, 111, 1111, 1111, É |
|
| Could also use lists: [ ], [1], [1, 1], [1, 1, 1], ... |
| 2-adic Numerals |
| The digits are 1 and 2. | |
| The empty string (denoted l so it is visible) stands for 0. | |
| The numerals are: l, 1, 2, 11, 12, 21, 22, 111, 112, É |
|
| Unlike binary numerals, there is no redundancy (1, 01, 001, 0001, É all mean the same thing in binary). |
| Roman Numerals |
| The digits are I, V, X, L, C, D, M. | |
| There is no string for 0. | |
| The successor of I is s(I) = II, s(II) = III, s(III) = IV, etc. |
| Numerals vs. Numbers |
| Numbers are abstract. | |
| Numerals are a concrete representation of numbers. |
| Strings over an alphabet S |
| The set of all finite strings over an alphabet S is denoted S*. | ||
| Example: | ||
| {a, b}* = {l, a, b, aa, ab, ba, bb, aaa, aab, aba, É} |
||
| Strings over an alphabet S |
| Basis: l, the empty string, is in S*. | |
| Inductive rule: If x ë S* and s ë S, then s x (s followed by x) is in S*. | |
| Extremal clause. |
| Languages |
| A language over S is any subset of S*, the set of all strings over S. | ||
| Examples, where S = {a, b}: | ||
| {a, b}* itself | ||
| { }, the empty language | ||
| {ba, baba}, maybe your first language | ||
| {l, aa, aaaa, aaaa, aaaaaa, É} the language of an even number of aÕs. | ||
| More Languages |
| More Examples, where S = {a, b} | ||
| {l, ab, ba, aabb, abab, baab, bbaa, aaabbb, aababb, É} the language in which the number of aÕs equals the number of bÕs. | ||
| {a, b, aa, bb, aab, aba, baa, abb, bab, bba, É} the language in which the number of aÕs is not equal to the number of bÕs. | ||
| {ab, abab, aabb, aababb, É} a language you might recognize. | ||
| Languages |
| There are lots of languages, some very weird. | |
| To be of computational interest, a language needs to be defined inductively. | |
| We need a way of telling whether a
given string is in the language or not (called parsing the string). |
| Non-Trivial Language Defined Inductively |
| L = {ab, abab, aabb, aababb, É} | ||
| Basis: ab is in L. | ||
| Inductive rules: | ||
| If x is in L, so is axb. | ||
| If x1 and x2 are in L, so is x1x2. | ||
| Grammars: A Shorthand |
| Spelling everything out with these inductive definitions is laborious. | |
| We need a shorthand, especially for more complex languages. | |
| The idea comes from linguistics and early work on computer languages. |
| Grammar Definition |
| There is a Òstart symbolÓ, or ÒrootÓ, say S, which is not in the alphabet of the language itself. | ||
| ¨ is a symbol meaning Òcan be rewritten asÓ. | ||
| Grammar rules, for example: | ||
| S ¨ ab | ||
| S ¨ aSb | ||
| S ¨ SS | ||
| Application of rules is by Òfree choiceÓ. | ||
| A sequence of applications is called a derivation. | ||
| The strings in the language are those that donÕt include S. | ||
| Using the Grammar Rules |
| Grammar rules: | ||
| S ¨ ab | ||
| S ¨ aSb | ||
| S ¨ SS | ||
| Example derivations of strings in the language: | ||
| S Þ ab | ||
| S Þ aSb Þ aabb | ||
| S Þ aSb Þ aaSbb Þ aaabbb | ||
| S Þ SS Þ abS Þ abab | ||
| S Þ SS Þ SSS Þ ababab | ||
| S Þ SS Þ aSbS Þ aabbS Þ aabbaSb Þ aabbaabb | ||
| Generalizing Grammar Rules |
| Instead of just S, allow multiple symbols, called auxiliaries, none of which are in the alphabet of the language. | |
| A distinguished auxiliary is called the root or Òstart symbolÓ. | |
| The symbols in the alphabet of the language are called terminals. | |
| The rules are known as productions. |
| Example: Grammar for Additive Arithmetic Expressions |
| The root is A. | ||
| The terminals are {a, b, c, +}. | ||
| The productions are: | ||
| A ¨ V | ||
| A ¨ V + A | ||
| V ¨ a | ||
| V ¨ b | ||
| V ¨ c | ||
| Example Derivations |
| The productions are: | ||
| A ¨ V | ||
| A ¨ V + A | ||
| Sample derivations: | ||
| A Þ V Þ a | ||
| A Þ V Þ c | ||
| A Þ V + A Þ c + A Þ c + V Þ c + a | ||
| A Þ V + A Þ c + A Þ c + V + A Þ c + b + A Þ c + b + V Þ c + b + a | ||
| Shorthands on top of Shorthands |
| The productions are: | ||
| A ¨ V | ||
| A ¨ V + A | ||
| Group by common left-hand sides | ||
| Use | (read ÒorÓ) to represent alternatives: | ||
| A ¨ V | V + A | ||
| V ¨ a | b | c | ||
| Note: | Òbinds more looselyÓ than other symbols. | ||
| Same grammar, just a briefer notation. | ||
| Derivation Tree Visualization |
| Syntax Tree (!= Derivation
Tree) Shows Implied ÒInterpretation" of String |
| Right Grouping (used so
far) vs. Left Grouping Productions |
| Does Grouping Matter? |
| Mathematically, + is an associative
operator: (a + b) + c == a + (b + c) |
|||
| However: | |||
| There are non-associative operators, such as -, where it does matter. | |||
| (a - b) - c ?= a - (b - c) | |||
| Also, on computers, for floating point addition, associativity does not always hold. | |||
| Floating Point is Not Associative |
| Try this: | ||
| sumup(m, n) = m > n ? 0 : 1./m + sumup(m+1, n); | ||
| sumdown(m, n) = m > n ? 0 : 1./n + sumdown(m, n-1); | ||
| test(n) = sumup(1, n) == sumdown(1, n); | ||
| map(test, range(1, 100)); | ||
| [1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0] |
||
| Grouping sensitivity is due to round-off error. | ||
| Precedence Issue (multiple operator symbols) |
| How do we ensure that the syntax tree
of a + b*c looks like this: and not this: |
| Multiple Auxiliaries |
| We want * to Ò bind more tightlyÓ than +. | |
| Use a different auxiliary symbol for each level of precedence. | |
| Arrange it so that expansions from the tighter binding auxiliary symbol can only be done after those of the looser binding auxiliary. | |
| Precedence Issue |
| We must ensure that the derivation tree for a+b*c looks like this: and not this: |
| Example: Grammar for Additive & Multiplicative Arithmetic Expressions |
| The root is A. | ||
| The terminals are {a, b, c, +, *}. | ||
| The productions are: | ||
| A ¨ M + A | M | ||
| M ¨ V * M | V | ||
| V ¨ a | b | c | ||
| Intuitive rule: Operator Òfarther from the rootÓ binds more tightly | ||
| Syntactic Categories |
| The various auxiliary symbols typically represent syntactic categories: sets of sub-expressions having a certain type of meaning. | ||
| Categories: | ||
| A ¨ M + A | M S is a ÒsumÓ | ||
| M ¨ V * M | V P is a ÒproductÓ | ||
| V ¨ a | b | c V is a ÒvariableÓ | ||
| Example Derivations |
| The productions are: | ||
| A ¨ M + A | M | ||
| M ¨ V * M | V | ||
| V ¨ a | b | c | ||
| Sample derivations (A is the syntactic category): | ||
| A Þ M Þ V Þ a | ||
| A Þ M + A Þ V + A Þ a + A Þ a + M Þ a + V Þ a + b | ||
| A Þ M + A Þ V + A Þ a + A Þ a
+ M Þ a + V*M Þ a + b * M Þ a + b * c |
||
| A Þ M + A Þ V*M + A Þ a *M + A
Þ a *V + A Þ a *b + A Þ a *b + M Þ a *b + V Þ a *b + c |
||
| Example Syntactic Categories |
| The productions are: | ||
| A ¨ M + A | M | ||
| M ¨ V * M | V | ||
| V ¨ a | b | c | ||
| Sample sub-derivations, e.g. from M: | ||
| M Þ V Þ a | ||
| M Þ V*M Þ a*M Þ a*V Þ a*b | ||
| M Þ V*M Þ a*M Þ a*V*M Þ a*b*M Þ a*b*V Þa*b*a | ||
| Observation: Derivations from M will never include any +Õs. | ||
| Exercise: Include ^ (power) |
| ^ binds the most tightly | |
| * is next | |
| + is the weakest |
| How to handle Ô(Õ Ô)Õ |
| Parentheses means Òhandle inside as a single unitÓ |
||
| Parallel level to a single variable | ||
| Sometimes called ÒprimariesÓ | ||
| Two Main Language Problems |
| Recognition problem: Is a given string in the language? |
|
| Meaning problem: What is the meaning of a string if it is in the language? |
| Na•ve Solution to the Recognition Problem |
| To determine whether string x is in the language generated by a grammar: | |||
| Start with the start symbol. | |||
| Generate strings successively by applying productions. | |||
| Eventually either: | |||
| The string x is generated, or | |||
| The new strings being generated all exceed x in length. | |||
| So we can tell whether or not x is ever generated. | |||
| Parsing |
| Parsing seeks to solve both problems: | ||
| Recognition | ||
| Meaning | ||
| In addition, it tries to do recognition much more efficiently than the na•ve solution. | ||
| Recursive Descent Parsing |
| Simplest reasonably general form of parsing. | |
| Works for many, but not all grammars. | |
| Sometimes a grammar can be transformed to enable recursive descent. | |
| Recall that each each auxiliary symbol in the grammar can be identified with a syntactic category, the set of strings that can be generated from that symbol (possibly with the help of other symbols). The meaning will derive from this idea. |
| Recursive Descent |
| ItÕs called ÒrecursiveÓ because in general grammar productions can ÒcallÓ themselves or each other. | |
| ItÕs called ÒdescentÓ because parsing starts at the root of a Òderivation treeÓ and proceeds toward the leaves. |
| Parse Methods |
| For each auxiliary symbol in the grammar, construct a parse method | |
| Each parse methodÕs responsibility is to recognize the longest string in the corresponding syntactic category in the remainder of the input, from the current point onward: | |
| a + b * c |
| Example |
| Consider the grammar with start symbol S: | ||
| S ¨ V + S | V | ||
| V ¨ a | b | c | ||
| The parse begins by trying to identify the entire input string as being in syntactic category S. | ||
| Clearly it must find a V to start. | ||
| To find a V, it checks to see whether the next symbol is one of those listed. | ||
| Having found a V, it checks to see if the next symbol is +. | ||
| If so, it recurses, trying to find another S. | ||
| If not it stops. | ||
| After the top call to S returns, it checks to see whether there are any spurious remaining characters in the input. | ||
| If there are, the input is not accepted. | ||
| If not, the input is accepted. | ||
| Example: Success |
| Suppose the input string is Òa + b + cÓ. | |
| Subscripts will indicate the particular instance of the method and the ÒargumentÓ will indicate the unparsed remainder of the input. | |
| The parser calls S1(Òa + b + cÓ). | |
| S1 calls V1(Òa + b + cÓ). | |
| V1 identifies a, returns success and unparsed input Ò+ b + cÓ. | |
| S1 checks for + and finds it; therefore S1 calls S2(Òb + cÓ). | |
| S2 calls V2(Òb + cÓ). | |
| V2 identifies b, returns success and unparsed input Ò+ cÓ. | |
| S2 checks for + and finds it; therefore S2 calls S3(ÒcÓ). | |
| S3 calls V3(ÒcÓ). | |
| V3 identifies c, returns success and unparsed input ÒÓ. | |
| S3 checks for + and does not find it; therefore S3 returns success with ÒÓ. | |
| S2 returns success with ÒÓ. | |
| S1 returns success with ÒÓ. The string is accepted. |
| Example: Failure |
| Suppose the input string is Òa b + cÓ. | |
| The parser calls S1(Òa b + cÓ). | |
| S1 calls V1(Òa b + cÓ). | |
| V1 identifies a, returns success and unparsed input Òb + cÓ. | |
| S1 checks for + and does not find it; therefore S1 returns success, with Òb + cÓ. | |
| Since the top-level call to S1 has returned, but there is residual input, the string is not accepted. |
| A rex version of parsing |
| Each syntactic category will be a rex function. | |||
| There is one argument: | |||
| the unparsed input, a list of characters. | |||
| There are two results: | |||
| success or failure indicator | |||
| for success: the Syntax Tree | |||
| for failure: FAILURE (some special value, not a syntax tree) | |||
| the unparsed input. | |||
| A rex version of parsing (1) |
| Test cases |
| A rex version of parsing (2) |
| Operators + and * with * having higher precedence |
| Rules: | ||
| A ¨ M + A | M | ||
| M ¨ V * M | V | ||
| V ¨ a | b | c | ||
| Note that * is analogous to +. | ||
| A is to M and + as M is to V and * |
||
| Therefore the same rule pattern applies to both. | ||
| rex parsing for +, * (A) |
| rex parsing for +, * (M) |
| Parsing Methods in Java |
| In the Java version, we will Ònot need toÓ return the unparsed input as a value. | |
| We can side-effect the input stream to achieve a similar result, Òusing upÓ characters as we go. | |
| We can store the input stream in the parse object, rather than pass it as an argument. | |
| Parsing Methods in Java |
| Additive Grammar |
| Runnable Examples |
| V() method |
| makeString(char c) |
| isVar() |
| Recursive A() method |
| Replacing some
Recursion with Iteration |
| ÒInverse McCarthy
TransformationÓ for Grammars with left-grouping |
| Recursion ¨ Iteration | |
| Works in some cases, not all | |
| Use for convenience and readability |
| A() method, iterative version |
| The Additive/Multiplicative Grammar |
| Remembering Precedence Rules |
| Tighter-binding operators are introduce further away from the root of the grammar: |
| Syntax Tree Applet |
| Example: SimpleCalc |
| Parses numeric expressions with +, *, () | |
| Computes the numeric answer | |
| Same grammar as SyntaxTree applet |
| Slide 76 |
| Grammar for Unicalc |
| Example | ||
| 3.5 meters^2 / (watt hour) | ||
| Operators | ||
| ^ | ||
| / | ||
| juxtaposition (implied multiplication) | ||
| Units (meter, second, etc.) | ||
| Numbers (floating point allowed: 1.23e-45) | ||
| Parentheses | ||
| Result of Parsing Unicalc |
| A Unicalc Quantity: Object with 3 components: |
||
| numeric multiplier | ||
| numerator | ||
| denominator | ||
| The parser may perform some ÒalgebraÓ: | ||
| ^ gets converted to multiplication | ||
| / and juxtaposition use Quantity divide and multiply | ||