This page describes the chart parsing algorithm. There are several versions of chart parsers: this one is predictive (driven by incomplete rules) rather than bottom-up (driven by what's next on the input stream).
A chart parser has three data structures:
Here's a very simple grammar for working examples:
S -> NP VP VP -> V NP VP -> V NP NP NP -> det noun NP -> det adj noun
Positions in the input sentence will be numbered starting with zero and will be the positions between successive words. For example:
The vine climbed the trellis
0 1 2 3 4 5
I will assume that the input words have been annotated with their part of speech, when the chart parser first reads the input. Therefore, the input really looks like:
The vine climbed the trellis
det noun verb det noun
The input is consumed left-to-right as parsing progresses.
The chart corresponding to this sentence might look as follows, after the parser is finished:
length 5 S 4 3 VP 2 NP NP 1 det noun verb det noun start 0 1 2 3 4
A cell in the chart can contain more than one constituent. With each constituent is frequently stored information about which parsing rule was used to generate it and what smaller constituents make it up. More than one such explanation can be stored for a single constituent (e.g. the NP in chart cell (x,y)) if that constituent had more than one parse.
The chart is used to prevent redundant work if there are two possible internal structures for a single constituent. For example, the NP "red socks and shoes" might be parsed as
[red [socks and shoes]]
or [[red socks] and shoes]
The final data structure is a set of edges. Each edge consists of a grammatical rule, plus information about how it matches up against the input. Specifically, the edge contains:
Edges are organized by their ending position (last input word matched against their rule). For example, in the trellis example, the edges might be:
start
0 S -> * NP VP
NP -> * det adj noun
NP -> * det noun
1 NP -> det * noun
NP -> det * adj noun
2 NP -> det noun *
VP -> * verb NP NP
S -> NP * VP
VP -> * V NP
3 VP -> V * NP
VP -> V * NP NP
etc ....
During the parsing process, edges are added but never deleted.
Suppose that there are k words in the input. Set up a chart with height and width k. Add the input words to the appropriate cells of the chart.
For each ending position i in the input (i.e. i runs from 0 up through k), set up two sets Si and Di. Set S0 to contain all rules expanding the start node of the grammar (i.e. the sentence node S for most English grammars). Initialize all the other sets to the null set.
Walk through the ending positions i, from 0 through k. For each ending position i, Si will be treated as a search queue (BFS or DFS). Edges will be extracted one by one from Si and put into Di. When Si becomes empty, we remove the first word from the input stack and proceed to the next ending position i+1.
Specifically, for each i:
loop until Si is empty
* remove first edge e from Si
* add e to Di
* apply three operations to e: scan, complete, and predict
these operations may produce new edges
* the new edges are added to Si or Si+1, if they are not already
in Si, Di, or Di+1
pop first word off the input stack
When all ending positions have been processed, the chart contains all complete constituents found by the parser. The input has been successfully parsed if the top left cell of the chart contains an S node. If so, we can extract one or more parse trees from the chart. (Multiple parse trees will exist if the sentence has ambiguous syntax.)
Suppose that the current edge e is not finished. The predict operation extracts the next item X needed by e (the constituent right after the dot in the edge). Predict finds all the rules in the grammar whose lefthand side is X. For each, it makes a new edge with dot on the left and adds it to Si.
For example, suppose that e is the edge S -> NP * VP, from position 0 and of length 2. Then we will add the following new edges
VP -> * V NP from position 2, length 0
VP -> * V NP NP from position 2, length 0
Suppose that the current edge e is not finished and that category X follows the dot in its rule (i.e. X is the next type of item that e needs). The scan operation examines the next word of the input. If it has grammatical category X, then it creates a new edge e' which is just like e except that the dot is moved one item right and the length is incremented by one. e' is added to si+1.
For example, suppose that e is the edge NP -> * det noun, from position 0 of length 0. Suppose the next input item is "the." Then we add to S1 the new edge
NP -> det * noun, from position 0, length 1
Suppose that the current edge e is finished. That is, its dot is at the far right of its rule. Suppose that e looks like
X -> Y1 Y2 .. Ym *
from position k, length m
First check if X is already in chart cell (k,m). If so, simply add the rule e to the set of explanations for this constituent X. This is all we need to do.
If X is not already in chart cell (k,m), add it. Then examine each edge E in Dk. If E is incomplete and the next item it needs is an X, create a new edge E' with dot moved right. The length of E' is the sum of the lengths of E and e. Add E' to Si.
For example, suppose that e is NP -> det noun *, from position 3, length 2. When we examine the set S3, suppose that we find a rule VP -> V * NP, from position 2, length 1. We then create the new rule VP -> V NP *, from position 2, length 3, and add this rule to S5.
This page is maintained by Margaret Fleck.