Learning to Play Pentago Using TD(lambda)
Austin Quan CS 152, Fall 2010
Forgive my meager webpageforging abilities; I'm a math major.

Abstract

Inspired by the success of gameplaying programs such as TD-Gammon, this project aimed to teach a feedforward neural network to play the board game Pentago, a game that relies on pattern recognition. The networks took in the raw board state as its input, and evaluated the strength of each player's position. All networks were feedforward multi-layer perceptrons with one layer, trained using the TD(lambda) algorithm. Training was done by playing against existing AI implementations, as well as playing against other networks. Learning did occur, but seemed to be limited, likely due to the constraint of a single hidden layer, as well as poor training design.

The Game

A game of the genre "seconds to learn, a lifetime to master", Pentago is a variation of 5 in a row that requires complex pattern recognition. It is played on a 6x6 grid, split into 3x3 quadrants. Players take turns first placing a piece of their color, and then rotating any of the four quadrants 90 degrees. The first player to get 5 in a row wins. The addition of the rotation element makes this game significantly more difficult, and it takes a solid amount of practice before a human player becomes proficient at it. The most essential skill in the game is the recognizing the potential for an opponent to win on the next move. Every AI bot that I have played I have been easily able to beat in 5 moves (the minimum), and so the goal was to train a neural network that could last more than 5 moves. Also, check out some awards it has won for awesomeness confirmation!

Design Considerations

The network and player (called Ann, of course) were designed to take both take advantage of certain aspects of the game, as well as improve performance. The following lists several features/observations about the game, and how they were used to design Ann.

Symmetry
The board has reflectional and rotational symmetry. That is, given a board state, if rotated, we can see that it is equivalent to a number of other board states, and thus should be evaluated equivalently by the network. To take advantage of this, the board is always oriented such that the quadrant with the most pieces is always in the upper left corner before being presented to the network. Furthermore, there is player symmetry in that if you take a board state and switch the colors of all the pieces, the value of the board positioning for each player should switch. By choosing the inputs into the network to be -1 and 1, for white and black respectively, and using hyperbolic tangent activation function with no bias or threshold, this property will always hold true as tanh is an odd function.

Deterministic
One of the fundamental challenges to machine learning approaches to board games is the problem of exploitation versus exploration. After some training time, the network will have learned a number of moves which it considers good. If trained properly, choosing these moves will improve its performance, but at the cost of exploring new potentially fruitful board states. TD-Gammon's success has been partially attributed to the fact that the dice rolls at a stochastic element to gameplay that makes exploration intrinsic to the game. We have no such luck with Pentago. To add an exploration element, each turn all the possible board states are evaluated by the network, producing a value for each move. The move is then picked probabilistically weighting each move by e^v, where v is the value of the move. This leads to good moves being chosen more frequently, but allows for exploration of the state space to occur as well.

Number of Ways to Win
There are 36 ways to get 5 in a row on the board. It follows then that the network would need at least 36 nodes in the hidden layer (though likely more) in order to effectively recognize the patterns necessary to play the game well.

Short Games
A Pentago game is short relative to other games (backgammon in particular), with ties rarely occurring. This should translate into a shorter training time than for the same approach to other games.

Rotations
Given a board, a player needs to first determine whether or not they must block their opponent from winning on their next turn. This usually requires rotating the different quadrants in their head, and then evaluating how close their opponent is to winning. As only simple 1-ply lookahead was implemented for this project, the network must learn to not just recognize the threat of 4 pieces in a row, but also be able to internally rotate the board, and evaluate the strength of the other player's position. Performing rotations in a MLP is certainly feasible as a rotation is simply a permutation of a set of spaces on the board, but the ability to rotate and then reevaluate likely requires multiple hidden layers. Only 1 hidden layer was used in this project as a starting point, however.

Training

Training was done using the TD(lambda) algorithm. This is the same algorithm used by TD-Gammon, and has been applied to a variety of other games as well with varying success. As with all temporal difference methods, the network is updated each turn, the difference between subsequent evaluations of the board used as the error to be backpropagated. A final win or loss signal is used at the end of the game. The value of the lambda parameter determines how previous patterns presented to the network impact weight updates at the current time step, and can be interpreted as the decay rate of its memory. A lambda value of zero is effectively training the network to minimize the difference between subsequent evaluations. On the other end of the spectrum, a value of one is equivalent to associating all of the patterns seen in the game with the final outcome.

Networks were trained in two ways. First, networks with varying parameters were trained against rule-based AI bots. The hope was that by using a somewhat skilled player to train against, the games would be short, and the network could quickly learn to associate the appropriate patterns with winning and losing. Second, neural networks with different parameters were trained against each other, in the spirit of TD-Gammon.

Results and Discussion
There were two outputs of the network, one for white and one for black. A win for black sent a final signal of [-1, 1], and [1, -1] was a win for white. We define the error of the network in a particular game to be the sum of the absolute difference in its evaluation of the final board state from the outcome. Thus, a random player (or newly initialized network) would have an error of 2, and the error can range from 0 to 4. Network performance was determined by calculating the average error over the last 100 games.

Training Against Bots
Playing against AI players, the neural network quickly converged (several hundred games) to low error values (about 0) for most games, and ~4 for the remainder. However, Ann was still unable to win many games, and when it did, it was often unexpected (error of 4). A closer examination of the bots revealed a flaw in this method of training: the bots were deterministic. Because they were deterministic, they played the same moves every time, unless their strategy was perturbed by the networks random placement of pieces. Over time, the network learned to recognize which player was the bot, and simply predicted that it would win. This approach to training could still work, but a more random bot needs to be used.

Training Against Nets
Neural nets of with different parameters were also trained against each other for ~20,000 games. Generally, they were able to reduce the error in their prediction to ~1.7-1.9, and could beat a random player 55-70% of the time, depending on the network. The problem here is that since the networks played as effectively random players at first, the games it trained off of were long, with the board nearly or actually full when the game ended. This is problematic as most of the pieces of the board are not significant to training (only the 5 used to win matter) and so they can be interpreted as noise.

Impact of Lambda
In previous uses of TD(lambda), no asymptotic performance difference was seen, but higher lambda values were generally associated with faster training times. While on lambda values of 0.3 and 0.7 were tested here, there was no significant difference in performance noted. The signs of the outputs of networks with different lambda values were usually the same, but the magnitudes of the outputs were greater for the higher value of lambda.

Impact of Learning Rate
Training turned out to be quite sensitive to learning rate. Learning rates of 0.1 and higher caused the network to get easily caught in local minima. In these minima, the network would evaluate any given board state in a sort of binary manner, evaluating it as a win or loss regardless of the number of pieces of the board. This is clearly unreasonable. Better success was achieved with lower learning rates, but error seemed to be bounded below by ~1.7, and no amount of training seemed to reduce it below that.

Impact of Hidden Units
Networks with 36, 72, and 144 units were trained. Those with more hidden units generally performed better in terms of average error and games won against a random player. No significant difference in training time was noted. I hypothesize that the limitation to one hidden layer is what limited the network from achieving average errors below ~1.7

Future Work
The following features would likely improve performance, and may be implemented in the near future.

2-Ply Lookahead
In 2-ply lookahead, the network first evaluates all of its possible moves, and then all of the possible moves of the opponent, finally selecting the one that it believes will benefit it the most. This is a very attractive option as the rotation of quadrants does not need to be done internally by the network; it is presented all of the possible rotational states and simply asked to evaluate the board. I suspect this would dramatically improve the results.

Multiple Hidden Layers
As stated above, to successfully play the game, the ability to internally rotate quadrants of the board is needed (assuming 1-ply). This would likely require additional hidden layers than just the one used in this implementation.

Different Training Approach
It is probably more efficient to first train the network on boards containing only 5 pieces in a row so that it learns to associate those with a win. This would help warm start learning, and then the network could be trained by other means. Additionally, a different AI player that has more randomness built in may lead to more successful performance.

Learn On Smaller Board First
As is standard in AI approaches to boardgames, it would be useful to first design and train a network to play on a 4x4 board, with 3 in a row to win. This would greatly reduce the training time, and allow for more tweaks and features to be tested before they are implemented on the larger board.

Source Code.