Bayesian Neural Nets and Active Relevance Detection
Important Links
NETLAB Library- An excellent and easy to use neural networks library with fully implemented versions of Bayesian Neural Networks.
Bayesian Methods for Adaptive Models- David MacKay’s thesis that describes most of the methods described here.
Other References:
Radford M. Neal, Bayesian Learning for Neural Networks, New York, Springer-Verlag 1996- Radford M. Neal’s thesis that formalizes ARD a bit more and introduces Hybrid Monte Carlo algorithms.
Nabney, Ian T., NETLAB: Algorithms for Pattern Recognition, New York, Springer-Verlag 2002- Book documenting and expanding on the incredibly useful Matlab library, NETLAB
Lagazio, Monica and Tshilidzi Marwala. “Assessing different Bayesian neural network models for militarized interstate dispute.” Social Science Computer Review, 2006, 24(1), pp. 119-131. - Original inspiration for this paper. Created a model for interstate conflict with Bayesian neural nets and used ARD to declare that liberal variables were most influential in determining conflict.
Presentation 1
Presentation 2
Introduction
Neural Networks are fantastic tools for solving problems in the face of vast complexity.
Researchers can use neural nets to model a problem with relatively little overhead. Neural Nets have proven invaluable in control systems, bio-medical systems, graphics, games, and a litany of other problems. However, Neural Nets do suffer from some drawbacks as compared with competitors. Chief among these is a tendency for neural nets to over-fit complex data, as shown in figure 1. Over-fitting, especially in noisy data, substantially reduces the usefulness of a neural network in almost all application. Methods to reduce the possibility of an overfit are thus highly desirable.

Additionally, although neural networks may be highly successfully in modeling complex data, extracting meaning from neural networks is often extremely difficult. For example, a neural network grown to find the relationship between environmental variables and instances of cancer may be an effective predictor of cancer, but attempting to use the neural network to understand which environmental variables are causing the cancer will be difficult at best. Thus automated systems for extracting meaning from a neural network are needed. Bayesian Neural Networks can provide a solution to the problems of both over fitting and meaning extracting.
In this paper I will review some of the general principles of Bayesian neural nets and then examine one particularly application of Bayesian neural nets, Automatic Relevance Detection.
Bayesian Neural Nets
The first major work on Bayesian Neural Networks was David MacKay’s PhD thesis “Bayesian Methods for Adaptive Models.” MacKay began his thesis by arguing “the need for Occam’s razor” for neural networks. Neural networks are not unique in their problems with over fitting: for any data set there exists a cubic function with 4 parameters that will fit the data better than a linear function with two parameters. However, the principle of “Occam’s razor” demands that simpler models be preferred over more complex ones, even if the simpler models don’t match the data quite as well. Thus what neural networks need are methods for automatically simplifying the model being developed. MacKay states that Bayesian methods can provide this simplification.
Bayesian Method
Bayesian methods are all rooted in Bayes’ rule:

Equation 1
Where è is a model under consideration and D is the data. Bayes’s rule states that the plausibility of the model given the data (P(è|D)) is equal to the probability the data was generated by the model, (P(D| è,)) times how probable we thought the model was prior to seeing the data (P(è)) over a normalizing constant (P(D) ). (MacKay, 9) Bayes’ rule thus allows us to quantitatively select between competing models.
This selection process is visualized in Figure 2 below. H1 and H2 are competing models that predict datasets. P(D| H1) and P(D| H2) are the normalized probability distributions that represent the probabilities of predicting data sets. H2 is the more complex model, allowing for a larger variety of predictions. H1 only accommodates a much smaller number of predictions. However, this means that if the dataset falls within region C1, H1 predicts this dataset much more strongly than H2 and therefore is the preferred model.

Figure 1- From MacKay
Automatic Relevance Detection
Use of Bayes’ rule to select between models is the distinguishing characteristic of Bayesian neural nets. One particularly useful and interesting application of this principle, first developed by MacKay and Neal, is Automatic Relevance Detection or ARD.
Normal neural networks typically use the mean squared error of the outputs y(x;w,A) to the expected values t as the error function ED. Thus

Equation 2
Where x is the inputs, w is the weights, A is the network model, and â is the “presumed noise” in the outputs. (MacKay, 43)
Additionally, ARD introduces a “weight decay” term EW

Equation 3
to the error function, making the error function E
![]()
Equation 4
EW has the effect of “regularizing” the weights. That is, EW penalizes large weights, which tend to be a source of over fitting in a neural network. Thus, as a neural network is trained it will shy away from large weights in preference to smaller weights.
The introduction of a weight decay term can reduce the tendency of a neural network to over-fit data, but excessive weight decay can result in the network becoming “flat” and miss important features and “under-fit” the data. Thus careful hand-tuning of the error parameters á and â is usually necessary to use a weight decay term.
ARD improves on weight decay by using Bayesian methods to
train the error parameters á and â. ARD assumes that the additive noise in
t is from a zero-mean Gaussian distribution, and that â is equal to the variance
of this distribution. Since EW is the quadratic in equation 3, the weights
are also expected to come from a zero-mean Gaussian distribution and á is thus
the variance of the weights.
An additional common formulation for the distribution of á is to break á into multiple ák with each ák representing the variance of the weights for a node k. and then integrate over the ák‘s. (Neal, 1996) This allows for greater precision in controlling weight variance and culling unnecessary neurons.
A Bayesian neural network with ARD trains in two stages. In the first stage the weights of the neural network are trained to convergence or an iteration limit according to a standard training algorithm modified for use with weight decay, like gradient descent or conjugate gradient. In the second stage the most likely values of á and â are re-evaluated for the trained weights using the evidence method described by MacKay (p44-45). This process is repeated until the user is satisfied with the results.
There are other variations of Bayesian Neural Nets and ARD, such as the Hybrid Monte Carlo algorithm (Neal, 55) that are beyond the scope of this paper. Nonetheless, most of the principles explained here apply to these other models.
ARD as a Meaning Extractor
ARD is meant to “determine which of many inputs to a neural network are relevant to prediction of its targets.” (Neal, 113) As such, ARD provides not just a way to improve data prediction, but also provides insight into the nature of the neural network itself. The ák values of the input nodes can be use to analyze their importance. These values can be extremely important. For example, a neural network grown to predict cancer from inputs of environmental variables is great, but if a user can extract which environment variables are causing the cancer that is even better.
One of the more interesting uses of ARD as analysis tool was by Lagzio and Marwala in “Assessing different Bayesian neural network models for militarized interstate dispute.” Lagazio and Marwala were studying data relating militarized disputes between pairs of countries to 7 variables: democracy in the countries, trade between the countries, balance of (military) power between the countries, distance, contingency, and is one a world power. With this data the authors grew two Bayesian neural networks using different optimization techniques. The results in both tests showed that the inverse ARD values were clearly greater for the democracy, trade, and balance of power variables. From these results, Lagazio and Marwala claimed that the experiment showed that the liberal variables (democracy and trade) were interrelated, were clear determinates of conflict and that thus the liberal thesis (that democracy and trade prevent conflict) is correct. (p10)
However, there is a fatal flaw in Lagzio and Marwala’s analysis: ARD only relates influence of an input; it does not specify the direction of that influence. Thus while the claim that the liberal variable matter is a true one, any claims about how they matter is not supported by the data. Nonetheless, demonstrating that the liberal variables matter is an achievement in and of itself.
Conclusions
Bayesian neural nets and ARD are useful tools in a researcher or model builder’s belt. The ability to automatically correct for over-fitting is powerful and can save lots of time on network design. A Bayesian neural net should definitely be considered when little is known about relevancy of the inputs or in cases of noisy data.
Additionally, the use of ARD as measure of input relevance can be used to expand the number of tasks for which neural nets are useful. By reporting back information about the importance of inputs ARD allows neural nets to move beyond being just a predictor and inform theory.