Introduction

Temporal difference learning is a neural network based approach to perform learning when an outcome is uncertain. It has been specifically applicable to games and has been applied to strong AI players in Chess, Go, and Checkers. The most successful application of temporal difference learning has been in the game of backgammon. Use of this technique has led to the creation of AI that is superior to all human players.

I am looking to apply this technique to the game of Havannah, which was a game created by Christian Freeling in the 1990’s as a game that would be particularly resistant to computer techniques. In 2002, he made a public challenge, promising $1000 to any program that can beat him once in a ten game match. Nobody has stepped up to claim this prize.

Havannah is played on a hexagonal grid with varying dimensions. The preferred size for human players (and the dimension the challenge is at) is 10x10, but interesting games can be played on smaller boards. In Havanna players take turns placing a stone of their color on the grid. They continue until one player has satisfied one of the following: A continuous chain that contains two corner squares, a continuous chain that touches three separate sides and does not rely on a connection from a corner square, or a continuous chain that forms a ring completely surrounding at least one hex.

Figure 1. Examples of the three win conditions in Havannah.

The complex interplay between these varied winning conditions has proven very difficult. All programs currently existing are very weak, but the strongest generally use Monte Carlo Tree Search.

Previous Work

Temporal Difference Learning and TD-Gammon Tesauro
Paper outlining the most famous application of TD-learning, the game of backgammon. The technique is layed out and performance in creating a world class backgammon opponent are discussed.

Temporal Difference Learning of Position Evaluation in the Game of Go Schraudolf, et al.
Interesing comments on leveraging symmetries to speed training. Also some insight into learning deterministic games.

Creating a Havannah Playing Agent Joosten
Insight into the properties of Havannah and what makes it difficult. Also a description of one of the stronger bots currently available.

Network Structure

The neural network I decided to use for all my experiments was a multilayer perceptron. It has a (n+1)-150-1 structure, where n is the number of hexes in the game board. An output of 1 represents a win for the first player and an output of -1 means a win for the second player. Each input is 1 if the space is occupied by the first player, 0 if unoccupied, and -1 if occupied by the other player.

Temporal Difference Learning and Training

The training method used for this project went through several iterations. The basic format of the training was self play. Agents would take turns playing games against each other and preform the temporal difference technique each time they played. This involves playing all potential next moves, finding the strongest one, and backpropagating it through the network with a possible weighing coefficient (I chose.99).

The issue is that since Havannah is a deterministic game, if agents always make the strongest move, they will settle into a rut and not explore a large number of branches of the game tree. Thus, while the agents need to be trained on the best available move, they need to make some suboptimal moves in order to further explore the game tree. First, I tried having the agents play random moves, but the state space in Havannah is too large for this to be effective. While we want to explore more of the state space, it is more efficient to only explore areas that are likely to contain good move sequences. Next, I tried playing random moves a given percentage of the time. If the percentage is fairly low, (<20%) then this technique can solve 2x2 Havannah in around 10000 iterations. It can also beat a random opponent on a 3x3 board 98% of the time. Another technique I tried was choosing random moves based on the potential strength of the best move. If I have a move evaluated at .95, I am less likely to want to try a random move than if the best is evaluated at .1. This technique resulted in a virtual 100% winrate against a random player on a 3x3 board. However, it still loses to any decent human player after 200,000 games of training. These bots generally were able to play three moves in a row along a side, but were unable to adapt if this initial win attempt was blocked.

The last training technique I tried was to choose moves by choosing the move according to a Gibbs distribution. This was done by assuming the probability each move is best is proportional to the exponential of its evaluation in the network. This technique was far more effective than any other technique. While other techniques created opponents that can create a basic initial threat, these AIs learned to make multiple threats in a row. Additionally, the bot follows good strategy by grabbing corner squares. One weakness of this bot, however, is that it was unable to recognize win threats by its opponent. After about 40,000 iterations, it was still straightforward for a human player to win on a 3x3 board, though the net is able to force wins against some human second moves.

Conclusions and Future Work

Overall, application of this technique shows promise as after 40,000 games the AI was able to learn many key strategic features of the game on a 3x3 board. However, applying this method on a 10x10 board would likely be too slow and it would be difficult to create a strong bot through self play. It is clear that for TD learning to work on a deterministic game, it is very important to consider state space exploration as performance increased by a large amount upon implementation of Gibbs sampling.

In the future, there are a number of aspects that could potentially improve the learning of this neural net. First, symmetries could be taken into account to create more training data. For example, the hexagonal board shape has a 6-fold symmetry that could create 6 training points instead of 1 for every move. I have already done this for turn symmetry (outcome should be opposite if colors are reversed). Also, I used an fairly arbitrary choice of network structure to train with. The learning process could probably be improved if more care were taken in selecting an appropriate network structure. Finally, I was limited in the size of game attempted by the efficiency of my code. It would be interesting to optimize the code and see what strategic features emerge if training is done on a 10x10 board.

Play a game against the net!

If you download the code you can play against the bot if you run HavannahApplet.class. Further training can be done by running Game.class.