CS 151, Spring 2000

CS 151 (Artificial Intelligence)
Chart Parser

This page describes the chart parsing algorithm. There are several versions of chart parsers: this one is predictive (driven by incomplete rules) rather than bottom-up (driven by what's next on the input stream).

Data structures

A chart parser has three data structures:

an input stack, which holds the words of the input sentence (in order)
a chart, which holds completed constituents organized by starting position and length
a set of edges, organized by ending position.

A sample grammar

Here's a very simple grammar for working examples:

   S -> NP VP
   VP -> V NP
   VP -> V NP NP
   NP -> det noun
   NP -> det adj noun

The input

Positions in the input sentence will be numbered starting with zero and will be the positions between successive words. For example:

    The   vine   climbed   the   trellis
  0     1      2         3     4         5

I will assume that the input words have been annotated with their part of speech, when the chart parser first reads the input. Therefore, the input really looks like:

    The   vine   climbed   the   trellis
    det   noun   verb      det   noun

The input is consumed left-to-right as parsing progresses.

The chart

The chart corresponding to this sentence might look as follows, after the parser is finished:

length

  5      S

  4 

  3                 VP

  2     NP                NP
  
  1     det    noun  verb det  noun
 
start    0     1     2    3    4

A cell in the chart can contain more than one constituent. With each constituent is frequently stored information about which parsing rule was used to generate it and what smaller constituents make it up. More than one such explanation can be stored for a single constituent (e.g. the NP in chart cell (x,y)) if that constituent had more than one parse.

The chart is used to prevent redundant work if there are two possible internal structures for a single constituent. For example, the NP "red socks and shoes" might be parsed as

      [red [socks and shoes]]
or    [[red socks] and shoes]

The edges

The final data structure is a set of edges. Each edge consists of a grammatical rule, plus information about how it matches up against the input. Specifically, the edge contains:

a rule (e.g. VP -> V NP)
the position up to which we have matched the rule to the input, usually indicated by a dot in the middle of the rule (e.g. VP -> V * NP)
the starting position, i.e. first input word matche
the number of input words matched (so far)

Edges are organized by their ending position (last input word matched against their rule). For example, in the trellis example, the edges might be:

start

  0     S -> * NP VP    
        NP -> * det adj noun
        NP -> * det noun

  1     NP -> det * noun
        NP -> det * adj noun

  2     NP -> det noun *
        VP -> * verb NP NP
        S -> NP * VP
        VP -> * V NP

  3     VP -> V * NP
        VP -> V * NP NP

  etc ....

During the parsing process, edges are added but never deleted.

The overall algorithm

Suppose that there are k words in the input. Set up a chart with height and width k. Add the input words to the appropriate cells of the chart.

For each ending position i in the input (i.e. i runs from 0 up through k), set up two sets Si and Di. Set S0 to contain all rules expanding the start node of the grammar (i.e. the sentence node S for most English grammars). Initialize all the other sets to the null set.

Walk through the ending positions i, from 0 through k. For each ending position i, Si will be treated as a search queue (BFS or DFS). Edges will be extracted one by one from Si and put into Di. When Si becomes empty, we remove the first word from the input stack and proceed to the next ending position i+1.

Specifically, for each i:

   loop until Si is empty
       * remove first edge e from Si
       * add e to Di
       * apply three operations to e:  scan, complete, and predict
         these operations may produce new edges
       * the new edges are added to Si or Si+1, if they are not already
         in Si, Di, or Di+1
   pop first word off the input stack

When all ending positions have been processed, the chart contains all complete constituents found by the parser. The input has been successfully parsed if the top left cell of the chart contains an S node. If so, we can extract one or more parse trees from the chart. (Multiple parse trees will exist if the sentence has ambiguous syntax.)

Predict

Suppose that the current edge e is not finished. The predict operation extracts the next item X needed by e (the constituent right after the dot in the edge). Predict finds all the rules in the grammar whose lefthand side is X. For each, it makes a new edge with dot on the left and adds it to Si.

For example, suppose that e is the edge S -> NP * VP, from position 0 and of length 2. Then we will add the following new edges

     VP -> * V NP from position 2, length 0
     VP -> * V NP NP from position 2, length 0

Scan

Suppose that the current edge e is not finished and that category X follows the dot in its rule (i.e. X is the next type of item that e needs). The scan operation examines the next word of the input. If it has grammatical category X, then it creates a new edge e' which is just like e except that the dot is moved one item right and the length is incremented by one. e' is added to si+1.

For example, suppose that e is the edge NP -> * det noun, from position 0 of length 0. Suppose the next input item is "the." Then we add to S1 the new edge

   NP -> det * noun, from position 0, length 1

Complete

Suppose that the current edge e is finished. That is, its dot is at the far right of its rule. Suppose that e looks like

    X -> Y1 Y2 .. Ym *
       from position k, length m

First check if X is already in chart cell (k,m). If so, simply add the rule e to the set of explanations for this constituent X. This is all we need to do.

If X is not already in chart cell (k,m), add it. Then examine each edge E in Dk. If E is incomplete and the next item it needs is an X, create a new edge E' with dot moved right. The length of E' is the sum of the lengths of E and e. Add E' to Si.

For example, suppose that e is NP -> det noun *, from position 3, length 2. When we examine the set S3, suppose that we find a rule VP -> V * NP, from position 2, length 1. We then create the new rule VP -> V NP *, from position 2, length 3, and add this rule to S5.

This page is maintained by Margaret Fleck.

CS 151 (Artificial Intelligence) Chart Parser