The CYK Algorithm

We conclude the discussion of context-free languages by answering the question, “How can we tell whether a string is in the language of a context-free grammar?”

Parsing is the process of taking a linear sequence of characters and converting this into structured data, such as a parse tree (or, more commonly in practice, into a simplified version of the parse tree called an Abstract Syntax Tree).

There are a number of clever parsing algorithms commonly used in practice, including Recursive Descent and Shift-Reduce Parsing. Both of these algorithms are very efficient; parsing a string of length nn can be done in O(n)O(n) time.

Unfortunately, these algorithms only work for some CFGs, not all. Recursive Descent is designed to work if the CFG is “LL(kk)” (for some k1k\geq 1), while Shift-Reduce Parsing works if the CFG is “LR(kk)” or “LALR(kk)” (for some k0k\geq 0).

In many computer applications, this isn’t a problem, because we can often modify the grammar or even the entire language to make these algorithms work! The original designers of Java, for example, chose a syntax for the language—keywords, operators, punctuation, etc.—specifically designed to be LALR(1).

However, we don’t always have the option of changing the language. A linguist trying to model a portion of English using CFGs, for example, doesn’t have the option of changing the syntax of English to make it easier to parse! And since this is a course on Theory of Computability, we still might like to know whether it’s possible to parse strings using arbitrary context-free grammars.

The answer is yes!

The CYK Algorithm

There are several parsing algorithms that work for all CFGs, such as Earley’s Algorithm, though none are terribly efficient (often O(n3)O(n^3) for an input of length nn). Here we will briefly describe the CYK Algorithm, probably the simplest algorithm for general-purpose CFG parsing.

(The CYK algorithm was independently invented by many researchers; the name comes from the initials of Cocke, Younger, and Kasami, three of people who rediscovered and published this algorithm.)

Chomsky Normal Form

Definition

A grammar is said to be in Chomsky Normal Form if all rules are either of the form AaA\to a (i.e., the nonterminal can be replaced by a single terminal) or of the form ABCA\to BC (i.e., the nonterminal can be replaced by two nonterminals).

Example

The grammar

S(S)  SS  ()\begin{array}{l} S\to (S)\ |\ SS\ |\ ()\\ \end{array}

is not in Chomsky Normal Form, because the first and third rules don’t have the right shape.

However, the grammar

SLT  SS  LRTSRL(R)\begin{array}{lcl} S&\to&LT\ |\ SS\ |\ LR\\ T&\to&SR \\ L&\to&( \\ R&\to&)\\ \end{array}

is in Chomsky Normal Form (and has the same language, namely non-empty well-balanced strings of parentheses).

The CYK algorithm only works (efficiently) for grammars in Chomsky Normal Form. This is less a problem than it might seem at first, because there’s an algorithm to convert any CFG into Chomsky Normal Form grammar with the same language. In fact, if the grammar doesn’t explicitly mention ϵ\epsilon, then the conversion is relatively easy to do by hand:

Example

Given the grammar

S(S)  SS  ()\begin{array}{l} S\to (S)\ |\ SS\ |\ ()\\ \end{array}
  1. First, we can replace the rule S()S\to () with the rules
    SLRL(R)\begin{array}{lcl} S&\to&LR\\ L&\to&(\\ R&\to&)\\ \end{array}
  2. Second, we can break up the rule S(S)S\to \mathtt{(}S\mathtt{)} into two binary rules
    S(TTS)\begin{array}{lcl} S&\to&(T\\ T&\to&S)\\ \end{array}
    and then replace the parentheses with references to LL and RR to get
    SLTTSR\begin{array}{lcl} S&\to&LT\\ T&\to&SR\\ \end{array}
    Putting this all together, we get the equivalent grammar
    SLT  SS  LRTSRL(R)\begin{array}{lcl} S&\to&LT\ |\ SS\ |\ LR\\ T&\to&SR \\ L&\to&( \\ R&\to&)\\ \end{array}

The Idea

Suppose our input is a string

a1a2ana_1a_2\cdots a_n

(where n1n\geq 1). The idea is to answer the question, “For every substring of the input aiaka_i\ldots a_k, what nonterminals can produce that substring?”

We denote by N(i,k){\cal N}(i,k) the set of nonterminals that can produce the substring aiaka_i\ldots a_k.

More formally, for every i,ki,k such that 1ikn1 \leq i \leq k \leq n,

N(i,k):={ BV  Baiai+1ak }{\cal N}(i, k) := \{\ B \in {\cal V}\ |\ B \Rightarrow \cdots \Rightarrow a_ia_{i+1}\cdots a_k\ \}

We are primarily interested in calculating the set N(1,n){\cal N}(1,n), because the start symbol SS is in this set iff Sa1anS\Rightarrow\cdots\Rightarrow a_1\cdots a_n, that is, iff our given string is in the language of the grammar.

The Algorithm

We can calculate N(1,n){\cal N}(1,n) recursively as follows:

N(i,i):={ CV  Cai }N(i,k):={ CV  CAB  AN(i,j)  BN(j+1,k) }\begin{array}{lcl} {\cal N}(i, i) & := & \{\ C \in {\cal V}\ |\ C \to a_i\ \}\\ {\cal N}(i, k) & := & \{\ C \in {\cal V}\ |\ C \to AB \ \land \ A \in {\cal N}(i, j) \ \land \ B \in {\cal N}(j +1, k)\ \} \\ \end{array}

That is,

  • Since our grammar is in Chomsky normal form, the only nonterminals that can produce a single character substring aia_i are those with a rule that directly replaces that nonterminal with character aia_i.
  • For a nonterminal CC to produce a longer string aiaka_i\cdots a_k, there has to be a rule CABC\to AB such that AA produces the first part of this string and BB produces the rest.

In practice, we don’t want to directly implement this recursive definition, because we end up with a lot of redundant work. (The same function calls happen over and over again. This is the same problem with the simple recursive definition of the Fibonacci function.) The easiest solution is to use Dynamic Programming.

In the Dynamic Programming implementation of CYK, we create a two-dimensional array N of size nn by nn, with the idea that N[i,k] will hold the value of N(1,n){\cal N}(1,n). Then we can define

N[i,i]:={ CV  Cai }N[i,k]:={ CV  CAB  AN[i,j]  BN[j+1,k] }\begin{array}{lcl} N[i,i] & := & \{\ C \in {\cal V}\ |\ C \to a_i\ \} \\ N[i,k] & := & \{\ C \in {\cal V}\ |\ C \to AB \ \land \ A \in N[i,j]\ \land \ B \in N[j+1,k]\ \} \\ \end{array}

If we fill the entries of the array in the right order (starting at the diagonal), then we can ensure that when computing the value of a new array cell, we only look at other entries in the array that have already been computed. The last cell filled is N[1,nn]; if it contains our start symbol then we have verified that the entire strings is in the language of the grammar.

Example

Suppose we want to parse

(()())\mathtt{(()())}

using the grammar

SLT  SS  LRTSRL(R)\begin{array}{lcl} S&\to&LT\ |\ SS\ |\ LR\\ T&\to&SR \\ L&\to&( \\ R&\to&)\\ \end{array}

We can create an array N of size 6 by 6, and start by filling in the diagonal entries (i.e., the sets of nonterminals that can directly produce each character in our string):

fill in diagonal

Then we can work on the next diagonal. There is no rule ?LL?\to LL so N(1,2){\cal N}(1,2) is empty. There is a rule SLRS\to LR so N(2,3){\cal N}(2,3) is nonempty (i.e, SS can produce the substring ()\mathtt{()} consisting of the second and third characters of the input). Continuing on, we get:

fill in diagonal

We can then fill in the next diagonal:

fill in diagonal

Above, TN(4,6)T\in{\cal N}(4,6) because SN(4,5)S\in{\cal N}(4,5) and RN(6,6)R\in{\cal N}(6,6) and TSRT\to SR, i.e., we have discovered that TT and can produce the end of our input  ()) ~\mathtt{())}~ because SS can produce the first two of these characters and RR can produce the last character.

Finally, we fill in the remaining diagonals to generate

fill in diagonal

Since SN(1,6)S\in{\cal N}(1,6), we have verified that the start symbol can produce the whole string, and the string

(()())\mathtt{(()())}

is in the language of the given grammar.

For a string of length nn we have to calculate values for O(n2)O(n^2) array elements, each containing a set of nonterminals. Each calculation takes O(n)O(n) on average, so the CYK Algorithm runs in total time O(n3)O(n^3).