Reinforcement Learning applied to Othello

Reinforcement learning with Othello

 

CS 152 Neural Networks

George Tucker

 

 

For our term project, we were allowed to choose to pursue a topic of interest.  When learning about temporal difference learning, particularly TD-Gammon, I was amazed by the fact that the network was learning unsupervised and by playing itself.  I decided to pursue temporal difference learning applied Othello.  Othello is played on a 64 square checkerboard with black and white discs.  The object of the game is control a majority of the pieces at the end of the game by turning over as many of your opponents pieces as possible.  It is not always easy to tell who is winning in Othello because a single move may change up to 20 pieces. 

 

For the project, I investigated improvement to TD-lambda when applied to Othello.  TD-lambda is a type of reinforcement learning technique.  In reinforcement learning, instead of giving the network inputs and having a target output, the network makes some action and then receives a reward (positive or negative).  The difficulty is that the network may make a series of moves before receiving a reward, and then it must somehow distribute the credit for the reward.  This is the temporal credit assignment problem and TD-lambda is one method for solving this problem.  Rather than try to summarize the TD-lambda algorithm, I refer the reader to Sutton’s original paper and additional references at:

 

Sutton's Original Paper

An addendum that explains the algorithm in more detail

Pseudo code (with some errors)

http://en.wikipedia.org/wiki/Temporal_difference_learning

 

In his paper entitled Reinforcement Learning in Board Games, Imran Ghory surveys the field of reinforcement learning applied to board games and suggests several avenues for improvement.  Earlier in the semester, I gave a presentation summarizing his work.

 

In particular, I implemented TD-lambda training with database play, self-play, asymmetric self-play, and board inversion.  My results are summarized in my final presentation.

 

Source

Code: src.zip

To use, create a new eclipse project, and use the extract folder as the project folder.

 

The base framework for the Othello code and training was based on http://www.cs.vu.nl/~mkeijzer/mlproject-2007/mlproject.html, and the benchmarking opponent engine is from http://www.luthman.nu/Othello/Othello.html.

 

References:

http://www.cs.bris.ac.uk/Publications/Papers/2000100.pdf

http://www.cs.ualberta.ca/~sutton/papers/sutton-89.pdf

http://www.cs.ualberta.ca/~sutton/papers/sutton-88.pdf

http://www.cs.ualberta.ca/~sutton/td-backprop-pseudo-code.text

http://www.luthman.nu/Othello/Othello.html

http://www.cs.ualberta.ca/~mburo/log.html

http://www.cs.vu.nl/~mkeijzer/mlproject-2007/mlproject.html

http://en.wikipedia.org/wiki/Q-learning

http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol2/zah/article2.html

http://en.wikipedia.org/wiki/Temporal_difference_learning