approach @ www.cs.hmc.edu/~rmcknigh/

Approach

Since we were not knowledgeable enough about the strategy and tactics of the game to adequately evaluate moves ourselves, we decided to use an unsupervised learning method. Our method is based on the TD(lambda) algorithm used in the TD-Gammon network by Gerlad Tesauro (see references).

The basic idea of our approach was that the network's evaluations should be constant throughout a game. This ideal behavior is certainly not possible since the network does not have complete knowledge of the game state and, therefore, cannot determine absolutely if it will win or lose. It should be possible, however, to produce consistent evaluations where the probability of winning changes slowly throughout the game as more information becomes available in the form of more cards having been played.

Our learning algorithm attempts to to achieve this temporal consistency by keeping a list of previous board positions for the current game. This history is then used after every move to train the network using a modified backprop method. The current evaluation is used as the target value for training on all of the past board positions in an attempt to bring their values into agreement with the current evaluation. Backprop has been modified to include an additional lambda term which augments the learning rate by decaying exponentially as we train farther back into the history list. The idea being that the more recent moves are more likely to have influenced the current evaluation and, thus, are updated more, while the moves made near the beginning of the game have less impact on the current position and so shouldn't be updated as much. In order to ensure that the evaluations are moving in the right direction, a final training target is given once the game is over which corresponds to either a win or a loss. This helps to counteract situations where the network thinks it is winning the entire game but is actually losing.

Network Architecture

The network is a fully-connected, single layer network with sigmoid activation functions. There are 108 input units and one output unit. The board is represented using a localist representation where the first 52 inputs represent the cards in the deck and are a one if the card is in the AI's hand and a zero if it is not. The next 54 units represent the piles of cards on the board. Instead of representing every card that has been played already, only the end cards of the stacks (the ones which can be played on) are activated while the others are set to zero. The value of the output unit represents the probability that a win will result from the current position.

Training Method

The network is trained by playing several thousand games against itself. During these games, each player is using the same network so that their combined experiences help to enhance the network faster. The random nature of the dealing process helps increase the amount of the game space that is searched so that the network can develop more general playing strategies instead of developing a strategy that, while self-consistent while it plays itself, would fail under other tactics employed by human players.