Reinforcement learning with
Othello
CS
152 Neural Networks
George
Tucker
For
our term project, we were allowed to choose to pursue a topic of interest. When learning about temporal difference learning,
particularly TD-Gammon, I was amazed by the fact that the network was learning
unsupervised and by playing itself. I
decided to pursue temporal difference learning applied Othello. Othello is played on a 64 square checkerboard
with black and white discs. The object
of the game is control a majority of the pieces at the end of the game by
turning over as many of your opponents pieces as possible. It is not always easy to tell who is winning
in Othello because a single move may change up to 20 pieces.
For
the project, I investigated improvement to TD-lambda when applied to
Othello. TD-lambda is a type of
reinforcement learning technique. In
reinforcement learning, instead of giving the network inputs and having a
target output, the network makes some action and then receives a reward
(positive or negative). The difficulty
is that the network may make a series of moves before receiving a reward, and
then it must somehow distribute the credit for the reward. This is the temporal credit assignment
problem and TD-lambda is one method for solving this problem. Rather than try to summarize the TD-lambda
algorithm, I refer the reader to Sutton’s original paper and additional
references at:
An addendum that
explains the algorithm in more detail
Pseudo
code (with some errors)
http://en.wikipedia.org/wiki/Temporal_difference_learning
In
his paper entitled Reinforcement
Learning in Board Games, Imran Ghory surveys the field of reinforcement learning applied
to board games and suggests several avenues for improvement. Earlier in the semester, I gave a presentation summarizing his work.
In
particular, I implemented TD-lambda training with database play, self-play,
asymmetric self-play, and board inversion.
My results are summarized in my final
presentation.
Source
Code:
src.zip
To
use, create a new eclipse project, and
use the extract folder as the project folder.
The
base framework for the Othello code and training was based on http://www.cs.vu.nl/~mkeijzer/mlproject-2007/mlproject.html,
and the benchmarking opponent engine is from http://www.luthman.nu/Othello/Othello.html.
References:
http://www.cs.bris.ac.uk/Publications/Papers/2000100.pdf
http://www.cs.ualberta.ca/~sutton/papers/sutton-89.pdf
http://www.cs.ualberta.ca/~sutton/papers/sutton-88.pdf
http://www.cs.ualberta.ca/~sutton/td-backprop-pseudo-code.text
http://www.luthman.nu/Othello/Othello.html
http://www.cs.ualberta.ca/~mburo/log.html
http://www.cs.vu.nl/~mkeijzer/mlproject-2007/mlproject.html
http://en.wikipedia.org/wiki/Q-learning
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol2/zah/article2.html
http://en.wikipedia.org/wiki/Temporal_difference_learning