CS 152: Neural Networks - Final Project

CS 152: NEURAL NETWORKS

Evolving a Sigma-Pi Network as a Network Simulator

Justin Basilico

[ Main | Problem statement | Approach | Results | References | Code directory | Presentation ]

For the most simple network architecture, consisting of just two input units and one output unit, a solution was evolved very quickly. In fact, it would normally take only between 10 and 25 generations to evolve a sigma-pi simulator network that had a fitness of 0.0, which means that it had zero error on simulating all of the networks from the dataset that was used to evolve the network. On the other set of networks to simulate that were not used in the evolution of the sigma-pi simulator for this simple architecture, it also had an average mean squared error of 0.0 and was 100% correct. The chromosomes for evolving this sigma-pi network only contained 22 bits, since there are only 22 possible connections between units in the sigma-pi network, which consisted of 5 input, 3 hidden, and 1 output unit. With such a small chromosome length that contains binary values, it is not surprising that the simulator for this very simple architecture was evolved. Although in general the network will evolve rather quickly, in some instances it will get stuck at a fitness of around 2.5. However, since it was so quick to evolve in most cases, this result is encouraging in that perhaps it will not be too hard to evolve sigma-pi simulators for larger networks.

During the course of creating this project, this network was always very quick to evolve as long as a reasonable fitness function was used. Adding additional features to the genetic algorithm such as the concept of groups of bits in the chromosome being crossed over together since they form a functional unit did not seem to positively or negatively effect the performance. The completely accurate simulation network also was able to evolve without using elite selection in addition to the rank selection. The crossover rate and mutation rate do effect how quickly the optimal result is evolved, but as long as the values were reasonable, it would still eventually create the solution. Early on in the project, a sigma-pi simulator was evolved for this architecture that used sigmoid units, which took only slightly more generations. However, sigmoid units were dropped, as mentioned above, because for the other networks they would tend to get stuck at local minima of just always guessing 0.5.

The second experiment consists of just adding one more output node to the previous network. Although this is just a small change in the architecture of the networks being simulated, it vastly increases the size of the chromosome that must be evolved from 22 to 68, since the sigma-pi network must now have 8 input, 6 hidden, and 2 output units. Since the size of the chromosome tripled, it is not surprising that it took a lot longer to evolve a solution. In general, it would take between about 100 and 500 epochs to evolve a solution because the algorithm would could get stuck at local minima that would take many generations to get out of. Given enough generations, however, it would always be able to get out of the local minima and get a chromosome with a best fitness of 0.0, which means that the network evolved had absolutely zero error on simulating the networks in the evolution set. Like in the first experiment, this best network created with fitness 0.0 also generalized and was able to simulate all of the networks in the separate testing dataset with 100% accuracy. The local minima that this network would tend to get stuck at is around 1.34, which it could get stuck at for several hundred generations in some instance. However, after inspecting one such network that evolved with this fitness, it seems that the network has evolved a correct solution to one of the two output units, where it is 100% correct, but it has not figured out how to simulate the other output unit. Since the network has no hidden layers, there is no interaction between the calculations for the simulation of the first and the second output units, so really the network is learning two tasks at once. In order to do this, the network is figuring out one of the tasks (one of the output units), and then figuring out the other one (the other output unit). This result is also it means that the network is evolving one part of the network that is basically the same network as in the first experiment. Thus, if this network was able to put together two versions of the network from the first experiment, perhaps the larger, multi-layer network will be able to do the same.

Since this experiment is inherently more difficult than the first one, it should not be surprising that the evolution of this network is much more sensitive to the fitness function and learning parameters used. If the fitness function is too coarse (such as evaluating the correctness within a threshold of 0.05), the genetic algorithm will get stuck in local minima a lot more often and may never converge on a solution. At first, one trick that would get the network to start off in the right direction was to add to the fitness function the number of connected units in the network (the number of bits that are 1 in the chromosome). This would push it off in the correct direction, but also created new problems where all the connections would be removed from the network so that there was not much diversity in the population. However, in general this would speed up the evolution. The fitness functions that used a threshold value for correctness rather than a straight mean squared error worked particularly well when trying to evolve a network that used sigmoid units rather than linear ones, since it overcame the problem of the local minima at 0.5. This fitness function was dropped, however, because it made it so that for the third experiment there was not enough of a difference among fitness values.

Since the other two networks were not that difficult to evolve, it seemed that it would be relatively simple to evolve the larger network that simulated networks with two input, two hidden, and one output units. However, this turned out not to be the case. The sigma-pi network needed to simulate these networks that have hidden units required a total of five layers (three hidden). The input layer had 11 units, the hidden layers had 9 (pi), 5 (sigma), and 3 (pi) units, while the output layer had 1 (sigma) unit. The total length of the binary chromosome needed to encode the connectivity of this network is 180. Thus, evolving the whole connectivity of the network is a much harder problem than the previous ones. Since the networks being simulated have multiple layers, there is a large dependency between what happens in the first few layers and the last few layers. If either one is incorrect, the network will not be able to produce the correct solution. Also, since the activation values for the hidden units are not looked at by the fitness function and there is only one output, there are not different sub-problems for the network to figure out, like there is in experiment two.

Given this complexity and the size of the chromosomes, it is not surprising that the full connectivity of a simulation network for this type of network was not able to be evolved. Whenever the genetic algorithm was applied to the problem it would seem to do well at first but would then get stuck at some local minima that it would not be able to escape from even when run for tens of thousands of generations. In order to try and get the sigma-pi simulator network to evolve, multiple variations on the genetic algorithm were tried. In particular, the concept of a functional unit in the chromosome was added, where chromosomes that make up the same functional unit would be crossed over together. To do this, a functional unit consisted of all of the weights into a particular node in the network. While it seemed that adding this concept might improve performance, it did not seem to have any effect on this experiment or the two previous ones, so it was eventually taken out of the program. When this problem was first encountered, only rank selection was used, so elite selection was added as well. Adding elite selection created somewhat of a momentum in the network so that the fitness would start decreasing a lot quicker at first. However, once the algorithm reached the same local minima it would still get stuck and never improve the fitness.

Since none of these changes seemed to improve the network much, the structure of the simulator network being evolved was changed slightly by fixing part of the network so that the size of the chromosome being evolved is smaller. Since the activation value for the output unit depends on both the weight values of the network that it is simulating and the hidden unit activation, all of these values must be available for the second-to-last layer in order to properly compute the output. However, since the weights are specified on the input layer, their values need to be passed down for two more levels unchanged. Doing this requires 3 nodes on each layer and very sparsely connected weights in order to just get these values to the proper place so they can be used. Thus, if this portion of the network is fixed so that the weight values for the second layer of weights in the network are automatically propagated through the network by fixing the connections for those units. Fixing this part of the network cuts the length of the chromosome by half, from 180 to just 90, which is much more reasonable.

For this simulation network, a solution with a fitness of 0.0 was evolved after about 3000 generations. After evolving the network, it was 100% correct on both the dataset used to evolve the network and on the other dataset that it had never seen. Although it was able to evolve a solution on some instances, on others it would not get to fitness 0.0 and would remain stuck at a local minima for thousands of generations. These local minima usually contain a parts of an optimal solution along with other parts along with other random connections that are hard to understand. Even when a complete solution is created, the algorithm will spend many hundreds of cycles stuck in one local minima before moving on to the next one. The reason it should not be too surprising that an optimal network when the network has this part of the connectivity fixed is that the remaining parts of the network only have to evolve both the networks from the previous two parts together, which is in fact just three occupancies of the smallest network. However, because of the dependency last few layers on the first few, it still is a difficult task to evolve such a network.

One of the main reasons that it is probably so hard to evolve networks in this way is due to the rather coarseness of the fitness functions used. It is very hard to come up with a fitness function for these networks where there is a lot of variety in the fitness. In most cases there seem to be distinct levels in the function that correspond to local minima without much between them. A reason for this might be due to the encoding that all weights are either 1.0 or 0.0. Perhaps allowing the network to evolve the weights would smooth out the fitness landscape so that it would not spend as much time stuck in local minima. Another logical step to follow would be to try and see if the same sorts of networks can be produced using sigmoid activation units rather than linear ones, or even let those evolve along with the network. Yet another avenue to explore along with allowing the weight values to be evolved is to use a less fixed architecture for the networks. For these experiments, the size of the layers in the sigma-pi network were chosen based on outside knowledge of what numbers of units were needed for the simulator to work. It would be interesting to offer the network more flexibility in order to see if it could evolve without as many constraints and to see if it perhaps evolves different solutions to the problem.

Since the simulator network was able to be evolved, it would be interesting to explore other possibilities of having networks such as sigma-pi networks use other networks as input and do interesting things with them. As mentioned early on, one such application might be to make networks that do the delta rule for output units, then ones for hidden units, and then possibly a larger network that does backpropagation on other networks. A pi-sigma network probably would be good for doing this, given the nature of the delta rule. Another avenue of future work would be to try and evolve a network that implements a learning rule for other networks. Here the network could be tested by running networks through it multiple times and seeing if it is improving the performance of the network. Although so far the networks being simulated have been assumed to just be simple summation networks, simulators could also be built for other types of networks, such as sigma-pi networks themselves. It would also be interesting to see if any learning algorithms for sigma-pi networks could learn the same sort of simulation tasks.

Another approach would be to see if sigma-pi networks could be created to act as a variety of different types of networks based on the input without explicitly specifying the network that the sigma-pi network will have to act as. This could be seen as just programming the sigma-pi network how to act at a higher level than explicitly specifying the network itself. In any case, there is a lot of interesting exploration that can be done in looking at these ideas, which still seem largely unexplored. The fact that networks can be simulated by other networks can form a basis for these explorations to start from.

[ Main | Problem statement | Approach | Results | References | Code directory | Presentation ]

This file is located at
http://www.cs.hmc.edu/~jbasilic/cs152/project/results.html