Adaptive parsing

Parsing is the transformation from flat text to data structures. Usually, this requires some kind of syntax definition as input in addition to the text to be parsed. An adaptive parser performs the transformation with minimal additional input; in particular, AP requires only syntactical information that can be provided by a typical user without the help of a programmer.

Current work

I am currently writing a prototype adaptive parser for extremely simple grammars that take the form of a list of similarly-structured records. The records in the targeted grammars are separated by textual delimiters of some length, as are the fields within the records.

Relevant to this work:

The code I've written is in a git repository. To see it in its current, disheveled state:
```
git clone http://www.cs.hmc.edu/~asampson/ap/ap.git
```
The code is written in Python (it was developed with Python 2.5 but compatibility with earlier versions is plausible). It depends on the NumPy package (available as part of SciPy). The 2D competitive learning visualization requires Matplotlib.
A brief overview of the technique used in the prototype (ap.py in the above repository); slides from a presentation on the problem and the same technique.

The next steps in this project will be:

Use the new clustering algorithms to more robustly handle variations in delimiters. This will require relaxation of assumptions in the cluster-filtering code.
Address performance. One possibility is to rewrite slow parts of the code in C. Another is to find a way to store training data for a given document. This way, an expensive training session may be run only once; cheaper evaluations using the training data could extract data when the trained document is updated.

Examples

Status of the parser on various test cases and examples of the process are coming soon.

Applications

Some possible applications of an adaptive parser — even an extremely simple, list-of-records one like the current implementation — include:

Automatic screen-scraping. Building scrapers (specialized parsers) for each target document can be tedious. In many cases, documents like Web pages contain lists and tables that represent records consisting of similar fields. The present adaptive parser could be used to read tables and lists in documents like Web pages — assignment lists, news postings, search results — with minimal user direction.
Anomaly detection. If a large stream of textual data is presumed to have approximately regular structure and is received from an unreliable source, early detection of irregularities may be useful. An adaptive parser could train on known correct data and flag data that seems to be malformed.
Biological "parsing." Development of this idea would require talking to a biologist, but genetic sequences may contain structures that would benefit from automatic discovery of structure.
Support for data mining. In large-scale machine-learning endeavors, large stores of human data like the Web are often used for training. Adaptive parsers could separate human-generated content from machine-generated structure and presentation in data pools like the Web.

Competitive learning for Python

The current AP prototype uses the competitive learning algorithm for clustering documents' n-grams. The git repository above contains the code for the generic competitive learning package for Python. The package is called cl and implements a convenient interface to vanilla CL, Frequency-Sensitive Competitive Learning (FSCL), and Rival-Penalized Competitive Learning (RPCL).

The package should work with any Python type used as neurons and stimuli, supports custom distance and learning functions, and is easily extendable. It includes code for using the module with Euclidean spaces, sequences in general, and strings in particular. Support is also included for using the CL algorithms for clustering.

The module is reasonably well-documented and -commented. Its only dependency is numpy. Clone the git repository and look in the code/cl directory. A formal release is forthcoming when the AP prototype is finished.

Author

This research is part of an ongoing project by Adrian Sampson. It was started in December of 2007 for a class instructed by Professor Robert Keller of the Computer Science department of Harvey Mudd College.