The CYK Algorithm

We conclude the discussion of context-free languages by answering the question, “How can we tell whether a string is in the language of a context-free grammar?”

Parsing is the process of taking a linear sequence of characters and converting this into structured data, such as a parse tree (or, more commonly in practice, into a simplified version of the parse tree called an Abstract Syntax Tree).

There are a number of clever parsing algorithms commonly used in practice, including Recursive Descent and Shift-Reduce Parsing. Both of these algorithms are very efficient; parsing a string of length $n$ can be done in $O(n)$ time.

Unfortunately, these algorithms only work for some CFGs, not all. Recursive Descent is designed to work if the CFG is “LL( $k$ )” (for some $k\geq 1$ ), while Shift-Reduce Parsing works if the CFG is “LR( $k$ )” or “LALR( $k$ )” (for some $k\geq 0$ ).

In many computer applications, this isn’t a problem, because we can often modify the grammar or even the entire language to make these algorithms work! The original designers of Java, for example, chose a syntax for the language—keywords, operators, punctuation, etc.—specifically designed to be LALR(1).

However, we don’t always have the option of changing the language. A linguist trying to model a portion of English using CFGs, for example, doesn’t have the option of changing the syntax of English to make it easier to parse! And since this is a course on Theory of Computability, we still might like to know whether it’s possible to parse strings using arbitrary context-free grammars.

The answer is yes!

The CYK Algorithm

There are several parsing algorithms that work for all CFGs, such as Earley’s Algorithm, though none are terribly efficient (often $O(n^3)$ for an input of length $n$ ). Here we will briefly describe the CYK Algorithm, probably the simplest algorithm for general-purpose CFG parsing.

(The CYK algorithm was independently invented by many researchers; the name comes from the initials of Cocke, Younger, and Kasami, three of people who rediscovered and published this algorithm.)

Chomsky Normal Form

Definition

A grammar is said to be in Chomsky Normal Form if all rules are either of the form $A\to a$ (i.e., the nonterminal can be replaced by a single terminal) or of the form $A\to BC$ (i.e., the nonterminal can be replaced by two nonterminals).

Example

The grammar

\begin{array}{l} S\to (S)\ |\ SS\ |\ ()\\ \end{array}

is not in Chomsky Normal Form, because the first and third rules don’t have the right shape.

However, the grammar

\begin{array}{lcl} S&\to&LT\ |\ SS\ |\ LR\\ T&\to&SR \\ L&\to&( \\ R&\to&)\\ \end{array}

is in Chomsky Normal Form (and has the same language, namely non-empty well-balanced strings of parentheses).

The CYK algorithm only works (efficiently) for grammars in Chomsky Normal Form. This is less a problem than it might seem at first, because there’s an algorithm to convert any CFG into Chomsky Normal Form grammar with the same language. In fact, if the grammar doesn’t explicitly mention $\epsilon$ , then the conversion is relatively easy to do by hand:

Example

Given the grammar

\begin{array}{l} S\to (S)\ |\ SS\ |\ ()\\ \end{array}

First, we can replace the rule $S\to ()$ with the rules $\begin{array}{lcl} S&\to&LR\\ L&\to&(\\ R&\to&)\\ \end{array}$
Second, we can break up the rule $S\to \mathtt{(}S\mathtt{)}$ into two binary rules $\begin{array}{lcl} S&\to&(T\\ T&\to&S)\\ \end{array}$ and then replace the parentheses with references to $L$ and $R$ to get $\begin{array}{lcl} S&\to&LT\\ T&\to&SR\\ \end{array}$ Putting this all together, we get the equivalent grammar $\begin{array}{lcl} S&\to&LT\ |\ SS\ |\ LR\\ T&\to&SR \\ L&\to&( \\ R&\to&)\\ \end{array}$

The Idea

Suppose our input is a string

a_1a_2\cdots a_n

(where $n\geq 1$ ). The idea is to answer the question, “For every substring of the input $a_i\ldots a_k$ , what nonterminals can produce that substring?”

We denote by ${\cal N}(i,k)$ the set of nonterminals that can produce the substring $a_i\ldots a_k$ .

More formally, for every $i,k$ such that $1 \leq i \leq k \leq n$ ,

{\cal N}(i, k) := \{\ B \in {\cal V}\ |\ B \Rightarrow \cdots \Rightarrow a_ia_{i+1}\cdots a_k\ \}

We are primarily interested in calculating the set ${\cal N}(1,n)$ , because the start symbol $S$ is in this set iff $S\Rightarrow\cdots\Rightarrow a_1\cdots a_n$ , that is, iff our given string is in the language of the grammar.

The Algorithm

We can calculate ${\cal N}(1,n)$ recursively as follows:

\begin{array}{lcl} {\cal N}(i, i) & := & \{\ C \in {\cal V}\ |\ C \to a_i\ \}\\ {\cal N}(i, k) & := & \{\ C \in {\cal V}\ |\ C \to AB \ \land \ A \in {\cal N}(i, j) \ \land \ B \in {\cal N}(j +1, k)\ \} \\ \end{array}

That is,

Since our grammar is in Chomsky normal form, the only nonterminals that can produce a single character substring $a_i$ are those with a rule that directly replaces that nonterminal with character $a_i$ .
For a nonterminal $C$ to produce a longer string $a_i\cdots a_k$ , there has to be a rule $C\to AB$ such that $A$ produces the first part of this string and $B$ produces the rest.

In practice, we don’t want to directly implement this recursive definition, because we end up with a lot of redundant work. (The same function calls happen over and over again. This is the same problem with the simple recursive definition of the Fibonacci function.) The easiest solution is to use Dynamic Programming.

In the Dynamic Programming implementation of CYK, we create a two-dimensional array N of size $n$ by $n$ , with the idea that N[i,k] will hold the value of ${\cal N}(1,n)$ . Then we can define

\begin{array}{lcl} N[i,i] & := & \{\ C \in {\cal V}\ |\ C \to a_i\ \} \\ N[i,k] & := & \{\ C \in {\cal V}\ |\ C \to AB \ \land \ A \in N[i,j]\ \land \ B \in N[j+1,k]\ \} \\ \end{array}

If we fill the entries of the array in the right order (starting at the diagonal), then we can ensure that when computing the value of a new array cell, we only look at other entries in the array that have already been computed. The last cell filled is N[1, $n$ ]; if it contains our start symbol then we have verified that the entire strings is in the language of the grammar.

Example

Suppose we want to parse

\mathtt{(()())}

using the grammar

\begin{array}{lcl} S&\to&LT\ |\ SS\ |\ LR\\ T&\to&SR \\ L&\to&( \\ R&\to&)\\ \end{array}

We can create an array N of size 6 by 6, and start by filling in the diagonal entries (i.e., the sets of nonterminals that can directly produce each character in our string):

fill in diagonal

Then we can work on the next diagonal. There is no rule $?\to LL$ so ${\cal N}(1,2)$ is empty. There is a rule $S\to LR$ so ${\cal N}(2,3)$ is nonempty (i.e, $S$ can produce the substring $\mathtt{()}$ consisting of the second and third characters of the input). Continuing on, we get:

fill in diagonal

We can then fill in the next diagonal:

fill in diagonal

Above, $T\in{\cal N}(4,6)$ because $S\in{\cal N}(4,5)$ and $R\in{\cal N}(6,6)$ and $T\to SR$ , i.e., we have discovered that $T$ and can produce the end of our input $~\mathtt{())}~$ because $S$ can produce the first two of these characters and $R$ can produce the last character.

Finally, we fill in the remaining diagonals to generate

fill in diagonal

Since $S\in{\cal N}(1,6)$ , we have verified that the start symbol can produce the whole string, and the string

\mathtt{(()())}

is in the language of the given grammar.

For a string of length $n$ we have to calculate values for $O(n^2)$ array elements, each containing a set of nonterminals. Each calculation takes $O(n)$ on average, so the CYK Algorithm runs in total time $O(n^3)$ .