CS 152: Neural Networks Final Project

Using Adalines to Approximate Q-functions in Reinforcement Learning

Reinforcement learning in its purist form is a technique for learning an optimal policy when using a Markov Model of the world. In this model, there is a finite set of states S = {all states} and actions A = {all actions for all states}. A policy maps from states to actions ( policy : S -> A ). Given a state-action pair, there is a delta function that determines the resulting state ( delta : (S,A) -> S ). Putting these two together we get a chain of state-action-state tuples where the policy determines the action and delta defines the resulting state. We can then apply a reward function, r, that given a state-action pair, gives a reward ( r(s,a) -> r ). In many cases, this reward comes in bursts, reflecting a chain of actions that finally arrived at the reward. To take this into account, the effective reward of an action is the actual reward this time, plus the discounted possible reward for future actions (we call the discount factor gamma which must be between 0 and 1): Reff = r(s,a) + gamma * max( r(s',a') ) for all a' where s' is given by delta(s,a). We then define the optimal policy as the one that maximizes the effective reward over the lifetime of the algorithm. The overall value of taking a given action from a given state is given by the function Q, or in many cases, an approximation of the Q function. The approximation of Q can be iteratively refined by the assignment Q(s,a) <- r(s,a) + gamma * max( Q(s',a') ) over a'.

The Green Light District Traffic Simulator

In the paper "Intelligent Traffic Light Control" ( http://www.cs.uu.nl/people/marco/GROUP/ARTICLES/intelligent_tlc.pdf ), Wiering et al describe their traffic simulator Green Light District. The paper goes into greater depth but I will give a brief summary of the system as it relates to my work here. An Infrastructure is made up of a collection of nodes connected by sets of lanes. The interesting nodes are edge nodes and traffic lights. Edge nodes spawn road users which each have a desired destination edge node. During each cycle of the simulation, the SimModel asks the traffic light controller (TLController) what the relative difference in predicted effective reward (called gain in the program) is for each lane being green versus being red. The SimModel then uses this information to pick the legal light configuration that will yield the most gain. Once the light configurations are determined, road users are allowed to move (provided their path is not blocked by a red light or another stopped road user). The TLController is given the opportunity to listen in on these movements to calculate rewards or anything else the TLController may find useful.

Inserting an Adaline

Along with the simulator, Wiering and company wrote a few different TLControllers to compare their performance under different conditions. Most of these algorithms used dynamic programming in one form or another to approximate the Q function. While proven to be an effective method for approximating Q, dynamic programming is limited to working with finite states and discrete inputs.

For my project, I implemented a TLController that uses an adaline to approximate the Q function. The primary advantage of using an adaline or other neural network is that it is not constrained to finite or discrete state spaces. The potential problems I foresaw were that it is hard to tell what the adaline will learn, and it may in fact not learn to effectively approximate Q. This problem turned out to be a secondary issue as it appears that the adaline was able to effectively learn to maximize the reward provided that there was a correlation between input and reward.

My controller uses two adalines per lane for each lane in the infrastructure: one to approximate the value of the light being red, the other to approximate the value of being green. The outputted gain is thus the difference between these two outputs. In a way this is a 2 x 1 two-layer network where the second layer has fixed weights [1, -1]. As input, every adaline is given the following information for every lane in the infrastructure: the number of cars waiting, traffic density, and whether or the lane is full. Thus the number of inputs is three times the total number of lanes in the infrastructure. Most of this information is quite irrelevant and in practice the weights for the inputs relating to most of the infrastructure became 0.

The rewards are calculated based on the movement of road users. When a road user is able to pass through an intersection, that lane is given a large reward. When a road user is allowed to move on a lane, the lane is given a small reward. Every time a lane is rewarded, the rest of the lanes at that intersection are given a discounted reward. This is intended to encourage lanes to be red when they have no cars waiting and other lanes do. Given that each lane has access to all of the information, the adaline is able to learn when it is advantageous to be red in this system.

Every cycle, the adaline makes a note of both its input and output for that cycle. Each cycle the adaline receives some amount of reward. As we are training the adaline to predict the reward for a state-action pair, we use the reward, as well as the previous state to train. As mentioned, in my implementation there is a separate adaline to predict the reward of being red and the reward of being green. Each cycle, we only are able to be red or green so the reward information only applies to one state. Thus we train the adaline that corresponds to the color of the light for that cycle. Over time, the adalines for both red and green converge to their proper values. Naturally, for this system to work, the TLController needs to experiment with apparently sub-optimal light configurations so that it can develop an accurate representation of the Q function. This is accomplished with a percentage chance for the outputted gain values to be set randomly for any given cycle.

Conclusions

Before beginning implementation, I expected the primary issue to be whether or not an adaline would be able to learn the Q function. In practice, the largest issues I faced were in choosing an effective input and reward scheme. I discovered that the number of cars and whether a lane is full are quite useful information for predicting the reward system I used. Adding information about the amount of time cars had been waiting had no effect on the overall performance of the TLController because there was no corresponding waiting time element to the rewards. The bottom line is that the adaline can learn to predict the rewards given relevant information, but that the overall performance of the system is dependent on the reward system.

When running the AdalineTLC on various infrastructures, its performance is roughly equivalent to that of the MostFirst algorithm (MostFirst simply takes the light configuration that would allow the most cars to get through the light this cycle). In all cases the TC-1 algorithm provided by Wiering and company out-performed my adaline algorithm. For the following tests, I primed the algorithm by first running for 500 cycles so that it could learn the Q-function, and then ran for 1000 cycles. The average trip waiting time (in other words, the average amount of time that it took a road user to get from their start to their destination minus the amount of time it would have taken had they been driving at top speed the entire time) is shown in the graph. There are two runs for the adaline algorithm to demonstrate that there is variability in how the adaline learns the Q-function. This metric is somewhat misleading as the adaline does not have access to waiting time as either an input or as an influence on reward.

Further Work

The primary advantage of the adaline over a dynamic programming approach is that it can deal with continuous state spaces. However, in the GLD simulator, everything is made discrete so this is largely lost. The next step in testing the effectiveness of an adaline in predicting the Q-function is to operate in a truly continuous environment.

As mentioned before, the reward system has an enormous impact on the performance of a reinforcement learning algorithm. It seems likely that changes in the reward system used could significantly increase the performance of the AdalineTLC. Naturally, the input data will also need to be modified to reflect changes in the rewards.

Finally, it may be argued that a more complicated network, such as a multi-layer back propagation network may be better able to predict the Q function. The actual gains of using a more complicated network are debated however, and some would say that there is no point, especially when we have the ability to increase the input space intelligently (http://www.cs.ualberta.ca/~sutton/RL-FAQ.html).

"Split" Infrastructure Performance
Split Infrastructure Split Performance Graph
"Complex" Infrastructure Performance
Complex Infrastructure Complex Performance Graph

Deliverables

Deliverable Link
Proposal CS152finalproposal.doc
Final Presentation, PPT swyckofffinalpres.ppt
Paper http://www.cs.uu.nl/people/marco/GROUP/ARTICLES/intelligent_tlc.pdf
GLD on SourceForge http://sourceforge.net/projects/stoplicht
Source code Modified GLD
Source code Just AdalineNeuralTLC.java
Note: If you are trying to add my AdalineTLC to your version of GLD, notice that I have not implemented the XMLSerializable interface properly so if you wish to save, be sure to record the weights and then retrieve them when loading. Other than that, add it to the menu and factory wherever you see fit--I didn't modify the rest of GLD in any real ways.

Steven Wyckoff
Page updated December 6, 2006
Contact me