Adaptive parsing

Parsing is the transformation from flat text to data structures. Usually, this requires some kind of syntax definition as input in addition to the text to be parsed. An adaptive parser performs the transformation with minimal additional input; in particular, AP requires only syntactical information that can be provided by a typical user without the help of a programmer.

Current work

I am currently writing a prototype adaptive parser for extremely simple grammars that take the form of a list of similarly-structured records. The records in the targeted grammars are separated by textual delimiters of some length, as are the fields within the records.

Relevant to this work:

The next steps in this project will be:

Examples

Status of the parser on various test cases and examples of the process are coming soon.

Applications

Some possible applications of an adaptive parser — even an extremely simple, list-of-records one like the current implementation — include:

Competitive learning for Python

The current AP prototype uses the competitive learning algorithm for clustering documents' n-grams. The git repository above contains the code for the generic competitive learning package for Python. The package is called cl and implements a convenient interface to vanilla CL, Frequency-Sensitive Competitive Learning (FSCL), and Rival-Penalized Competitive Learning (RPCL).

The package should work with any Python type used as neurons and stimuli, supports custom distance and learning functions, and is easily extendable. It includes code for using the module with Euclidean spaces, sequences in general, and strings in particular. Support is also included for using the CL algorithms for clustering.

The module is reasonably well-documented and -commented. Its only dependency is numpy. Clone the git repository and look in the code/cl directory. A formal release is forthcoming when the AP prototype is finished.

Author

This research is part of an ongoing project by Adrian Sampson. It was started in December of 2007 for a class instructed by Professor Robert Keller of the Computer Science department of Harvey Mudd College.