Parallel Implementation of a Deep Belief Network

Eric Mullen, Neural Networks, Fall 2010
My Code Presentation Slides RBM-Provisor Code JOCL Library

Summary of Problem

Deep belief networks have been shown to do amazing things. However, there are no algorithms to date that facilitate a particularly fast serial implementation of a deep belief network. Training takes a long time, as these networks are very complex. Further, aquiring data for a small weight change requires a large amount of activity of the network, this amount of activity scaling linearly with the depth of the weights being changed.

Specifically, RBM-Provisor uses a deep belief network to generate jazz music. This is a project of Professor Keller at Harvey Mudd College, and has been shown to be promising in the field of jazz music generation. The deep belief network that is used for this system trains slowly, which is a pain, and also generates music slowly, making it infeasible for real time use. A significant speed up of the training could facilitate more systematic experimentation with training data sets and training tweaks, while a speed up in the music generation might make the system more feasible for real time generation of jazz.

Possible Solutions

The current state of affairs is that the deep belief network used for RBM-Provisor is written in Java, in serial. One reason that this program is slower than it could be is simply one of the constraints of Java: run time array bounds checking. Every time that an array element is accessed or assigned to, Java checks to make sure that the access is not out of bounds. Since the deep belief network is represented as arrays, this slows down the process. A solution to this would be to write the network in C, and call the native code from Java. This would be able to eschew the bounds checking overhead, placing the burden on the programmer to instead verify that the arrays are not accessed out of bounds at the time of writing. This would result in a speed increase, but it probably would not be to the extent of the speed increase of the next proposed solution.

Deep belief networks are a very embarassingly parallel SIMD system. The speed increase that could occur with a parallelization of the current code could be on the order of hundreds or even thousands of times faster (ideally, of course). Since most everything can be done in parallel instead of in serial, the potential for speed up is huge. This is the approach that I took to speed up the operation of this Deep Belief Network.

Deep Belief Networks

A deep belief network is a series of layers of neurons, each of which are either in an on state or an off state (0 or 1). Each layer is symmetrically connected to other layers, so that the network is essentially linear: the first layer is connected to the second, the second is connected to the first and third, and so on. Each neuron in a layer has a connection to every neuron in each adjacent layer. Each connection has a weight associated with it.

To activate a layer means to update the states of all the neurons in that layer, using the connections either from the layer above or the layer below. For each neuron in the layer being activated, the weighted sum of the product of the corresponding weight with the neuron state in the layer adjacent to the activating layer is calculated, and if this sum is above a randomly generated threshhold, the neuron is turned on.

Contrastive Divergence Training Method

The contrastive divergence training method is a method for training deep belief networks. It is the method used to train this particular network. It consists of a few basic operations, all of which are quite embarassingly parallel. A subset of the same operations are used to generate results (or classify samples) with this type of neural network architecture, so by focussing on speeding up these operations, the network performance as a whole will increase.

These operations consist of activation (which is discussed above), accumulation of differences, and weight adjustment. Accumulation of differences accumulates a matrix the size of the weight matrix, which each entry is the product of the states of the neurons at each end of the connection. Adjusting the weights consists of multiplying one or more difference matrices by the learning rate and adding them to the weight matrix.

To train any one particular layer, you activate the network from the input up to the neurons just below the weights you want to train. You then activate once more, across the weights you want to train, then accumulate the differences. You then activate back down once, and then back up again, and accumulate the differences a second time. You add the first set of differences (multiplied by the learning rate) to the weights, and subtract off the second set. You repeat this ad nauseum for each layer and each piece of training data.

Parallelizability

Formally, any two operations can be performed in parallel if neither of the operations depends on the result of the other. At the macro scale, each of the operations we discussed above depend on the previous, and thus cannot be intermixed. However, at the micro scale, many things can be done in parallel.

First, activation. Each neuron depends on the state of every neuron in the other layer in order to determine its state. However, none of the neurons being activated depend on any of the other neurons that are being activated, and thus each neuron can be activated simultaneously with every other neuron. If you have n neurons in the layer that you're activating, you could get a parallel speed up of up to n times faster.

Accumulation of differences and weight adjustment are even more parallelizable: each individual connection, characterized by a unique neuron at each end, is independent. Thus if you have m neurons in the layer being read from, and n neurons in the layer being activated, you could get a parallel speed up of m times n times faster.

Possibilities for Parallelizability

Usually when you think about parallelizing an algorithm, you think about running it on a dual core or quad core computer. This, however, only allows for a factor of 2 or 4 speed up, which would be nice but not ideal. However, since this is very strictly SIMD parallelism, we can turn to our graphics cards for sheer data parallel compute power.

There are really 2 routes to go if you want to write code to run on your graphics card, and you're not doing any kind of computer graphics with it. One of them is specific to NVidia cards only (CUDA), while one works on both AMD/ATI cards and NVidia cards (OpenCL). AMD had its own system for a while, but has since taken up pushing OpenCL. There were two reasons that I decided to use OpenCL and not CUDA: OpenCL works on all recent graphics cards, and OpenCL has numerous Java wrappers. I used JOCL, a thin wrapper around the OpenCL library written in JNI for this project. Installation instructions are linked at the top of the page.

Overview of OpenCL

To really get a good idea of what OpenCL is, this page is a better resource than anything. However, the basic idea is that you create some kernels, or little bits of C code that will be executed a large number of times on your GPU. At runtime you compile them, so they are usually simply string constants in your code. OpenCL maintains a global queue of work tasks, and you can enqueue kernels with specific arguments to be executed, in parallel, on your GPU. You also have GPU side memory objects, which you can enqueue reads from and writes to. There are a bunch of other gory details, but this is the basic idea.

Notes about Implementation

I wrote a class called CLLayeredRbm, available by the link above. It has an OpenCL kernel for each basic operation of a deep belief network, activation (up and down), accumulation (positive and negative), and weight adjustment. It was my hope that by parallelizing these processes, the time spent training the network would be greatly reduced. I don't claim to know everything about making OpenCL applications fast, and thus could be missing something simple that would greatly speed up the existing network.

Benchmarking Results

To benchmark this I used my clinic workstation, since it has quite the graphics card. The processor is an Intel Core2Duo E6750 at 2.66Ghz, with an NVidia Geforce 8800 GTX graphics card. This has an NVidia G80 GPU, with 128 stream processors and a 575MHz clock.

To benchmark I selected 4 childrens song licks to train on from the Data folder included in RBM-Provisor. I picked these since they were short, and thus allowed training to take place in under 10 minutes. The original code took 5.39 minutes to train on these 4 licks, while the parallel OpenCL code took 4.88 minutes to train. Further, the OpenCL code will scale to work on much more powerful, newer graphics cards quite nicely, with additional performance gains there.

Future Work

There is quite a bit of work to still be done on this project, even though it is already showing some promising results:

Random number generation is done all on the CPU, and then the numbers are copied to the GPU. This is because random number generation is an inherently serial task, with each random number generated depending on and updating the state of the random number generator. Currently there is no OpenCL support for any kind of random number generation, but there is definitely talk of including a more parallel approach to random numbers.

I have not formally experimented with different ways of performing synchronization in OpenCL. Right now I use barriers on the global work queue, that don't let anything cross them. However, there are ways of putting more efficient synchronization inside your kernels, that I didn't play with at all. These might lead to faster code.

Right now, every time that a piece of data is presented in the training loop, it is copied over to the GPU. Due to the amount of data reuse, it would probably be much more efficient to copy every piece of training data over to VRAM once, at the beginning of the loop, and then simply read from there.

The code is fairly solid as it is, but could definitely benefit from a formal code review. Since the functionality is supposed to exatly replicate another existing piece of code, a formal review comparing the two would be useful.

Further, for better speed increases, the training should be benchmarked and profiled on a much newer graphics card with more processors, to really get an idea of the serial bottlenecks. On the computer that I benchmarked on, the parallelized version ran just a little bit faster than the serial version, something that seems fairly troubling. Further investigation in this direction would most likely prove fruitful.