Long Short-Term Memory

CS152, Prof. Keller

David R. Morrison

12 Dec. 2006

Project Description:

My final project for this class was to build a neural network using the Long Short-Term Memory architecture (with forget gates) described by Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins in the following two papers:

LSTM is an extension of Back-Propagation Through Time that uses memory cells that learn to store and release information at the appropriate time. This is accomplished by using gates to control information and error flow through the cells. It uses a method similar to truncated BPTT for learning.

Presentation:

Paper Presentation

Progress:

The primary aspect of this project was simply building the network itself, which took a fairly large amount of time. The code is written in C++, and broken up into a variety of different classes, detailed below:

Network: The network class is the overarching container for the LSTM network. It controls all aspects of network construction, running, and training.
CLayer: The CLayer class is a single layer in the network that holds a certain number of cell Blocks, each with a given block size.
Block: A block is simply a collection of Cells that share a common input, output, and forget gate
Cell: The core of the LSTM network, Cells house the constant error carousel and control gating.
NLayer: Used for an output layer, with normal Neurons instead of Cells
Neuron: Basic linear adaline

The first step, after coding, was to do a simple test to make sure the network was behaving properly. I wrote a test file that created a Network with one CLayer, one Block, and one Cell, fed in two inputs, and a single output. The network's goal was to always output 2.0; after a significant amount of debugging, I got to the point where this worked; it turns out that the network completely squashed any input that was given to the network, and trained the output neuron's bias weight to 2.

The next test was to train the network to learn the Embedded Reber Grammar. This is a standard benchmark test for any recurrent network to see how well it works; here I ran into problems. I encoded the 7 characters in the Reber Grammar with a one-hot code, and the network's task was, given an input character, predict what the next character (or characters) would be. For this experiment, I used a single layer of two Blocks, each with two internal Cells. There were seven inputs and seven outputs, and the network's goal was to learn to output a 1 for each possible letter that might be next in the sequence. First I tried this with a single string (“BPBPTTVVEPE”); for this simple test case, it learned to correctly predict the next letter (with a MSE of 0.0000999 and a learning rate of 0.1) in the sequence in approximately 10652 epochs. Next, I trained it on a set of 1000 randomly generated strings in the Reber grammar, and then used that network on a test set of another 1000 strings. This test would have taken a very long time to run to achieve a 0.0001 MSE, but actually learned to correctly predict the output (using a hardlim transfer function on the output neurons) after a few hundred epochs.

Code: