Making Neural Networks Bayesian

Deep neural networks trained with stochastic gradient descent are a fantastic machine learning algorithm that have achieved best in class performance for many of the most challenging machine learning problem, notably image and speech recognition. But I have issues with them mostly because they are not at all Bayesian (and I think almost everything should be). What we are getting is a MAP assignment to the parameters of a very flexible non-linear model, which works great so long as we have enough data to be confidant that the MAP assignment generalizes to new data. But what if we need a very complex model and we don’t have millions of training examples? It would be great if we could have something more Bayesian, a probability distribution over the parameters that we could use to predict a probability distribution over new data.

The good news is that we can, the bad news is that we also can’t. I would be thrilled if someone can tell me different but I don’t believe there is any way to directly obtain a joint distribution over the parameters, what we can do however is sample. The basic idea, known as the Metropolis-Hastings algorithm is very simple and always works:

  1. Randomly initialize the parameters
  2. Take a random step in the parameter space (called sampling from the proposal distribution)
  3. Accept (update the parameters) or reject (revert to the original parameters) the step based on an acceptance probability
  4. whether the new proposal was accepted or rejected record the current parameter set as a sample
  5. repeat 2-4 forever

Here the acceptance probability comes from the loss function, we treat the loss as a log likelihood and accept with probability.

\min\{1,\frac{\pi_{new}}{\pi_{old}}=\exp(\mathcal{L}_{old}-\mathcal{L}_{new})\}

In the limit, as the number of samples goes to infinity, the distribution of the sampled parameters converges to the true distribution. Which sounds great, unfortunately in high dimensional space, like the parameter space of a neural network, the algorithm has some issues. The optimum step size gives an acceptance probability of around 23%. Bearing in mind that we need to go through the whole data set to get a single step this is looking pretty slow, add to that the step size to get a 23% acceptance rate shrinks exponentially as we increase the number of dimensions and it’s obvious the ‘dumb’ Metropolis Hastings approach is not going to get us there in any sort of reasonable time.

There are better approaches, notably Hamiltonian Monte Carlo which does much better especially in higher dimensional spaces. The idea is to take advantage of some properties of Hamiltonian dynamics to move much faster through the space and with much greater acceptance rates. Neal uses the analogy of a puck on a non-flat icy surface but since I think ice is flat we will consider a ball rolling on a non-flat surface, in any case the height of the surface is the value of the loss function.

  1. Randomly initialize the parameters
  2. Randomly initialize the speed of the ball bearing (give it a whack)
  3. Wait a set time period
  4. Record the balls new location
  5. Repeat 2-4 forever

Looks amazing! We are flying around through the parameter space and accepting all of the samples. Except it’s not so fantastic because to simulate our movement through the parameter space (our ball rolling on a surface) we need to discretise the motion into a sequence of steps and for every step we need to do a full pass through the data to get gradients to calculate the next step. On top of that the discretisation introduces an error so we do still need to check an acceptance probability. The optimum now is more like 65% but since we need to do multiple passes through the data to get each step it’s not as great an improvement as we might have thought.

And that is why you can’t come up with a good sampling algorithm for neural networks.

But then some clever people did. At the heart of the idea is the observation that the gradient of a mini-batch is an estimate of the gradient of the full data set and assuming the mini-batch is reasonable large the Central Limit Theorem should hold. Going back to our ball rolling on a surface the effect of using a mini batch is like adding some random wind blowing the ball around the surface. I have seen three approaches so far if you know of any others let me know.

Stochastic Gradient Langevin Dynamics

Langevin Dynamics are a special case of Hamiltonian dynamics where we resample the momentum after every step. The stochastic gradient approach uses the observation that if we resample the momentum from a normal distribution with variance equal to the learning rate and gradually anneal the learning rate to zero the injected noise from the momentum will dominate the sampling noise from using a batch rather than a full gradient. The authors also show empirically that for small learning rates we do not need to worry about the annealing to zero part (which makes sense since keeping the learning rate constant is the same as annealing it really, really slowly). This is a great leap forward

  • We get a sample every batch instead of running through all the data
  • We get to keep all the samples, acceptance is close enough to 1 that we can accept every step

It’s not all good news because we are still moving around the parameter space very slowly, our learning rate has to be small and we have random walk behaviour. So along comes a better idea.

Stochastic Gradient Hamilton Monte-Carlo

This is where it starts to get really clever. We can’t simply use Hamilton Monte Carlo dynamics because the random wind will blow us off course. So the solution is to add more wind, a lot more wind so the randomness from the gradient sampling is small in comparison, and at the same time add some friction in proportion to the wind. The friction reduces the energy at something like the same rate we are adding it with the wind and we end up with, to a good approximation, Hamilton Monte Carlo but now we can sample every batch. There is still a problem however, for a given friction imput we need to guess how much wind to add including the wind from the batch sampling and we don’t know what that is. The authors approximated this from a sort of running average of the variance of the gradients through the batch but this is not easy and it makes the assumption that this variance is more or less constant anywhere in the parameter space (which it probably isn’t). So finally we get to my favourite.

Stochastic Gradient Nosé-Hoover Thermostat

Even the name is awesome. The idea here is that instead of keeping the friction constant and trying to work out how much wind to add we add a constant amount of wind and adjust the friction to try and keep the temperature constant.

Hold on a second. What temperature? What are you talking about?

Great question. In fact the temperature was always with us it wasn’t worth mentioning up to now because it was a constant equal to 1. Our acceptance probability was really

\min\{1,\frac{\pi_{new}}{\pi_{old}}=\exp(1/T(\mathcal{L}_{old}-\mathcal{L}_{new}))\}

At high temperature the acceptance probability is always equal to 1 and we are moving randomly (sampling from a uniform distribution). You can think of our random wind as raising the temperature and the added friction as reducing it. So if we measure the temperature through the kinetic energy we can add friction when we move faster and reduce it when we slow down to keep near a constant temperature. This is actually the same thing that SGHMC is trying to achieve but simpler. I particularly like it as an idea because it compensates automatically for the fact that the entire data-set is itself only a batch so instead of trying to recreate the performance over the set we are trying to get at something more fundamental.

Some  Final Thoughts

This is all pretty amazing stuff, we now have three algorithms for sampling from the parameters space every batch. Two of these take advantage of Hamiltonian dynamics to move around the space quickly and all are reasonably easy to implement in a deep learning framework. The next post will show how to do this in mxnet for the SGNHT algorithm. All of this can and should be applied on any algorithm with available gradients, even approximate gradients.

I do think these algorithms are going to have a huge impact on deep learning, so much of the work done with deep neural networks has been around forcing them to generalise i.e. CNNs, dropout etc. and their application has really been limited to settings with huge quantities of training data. The ability to quickly generate parameter samples changes the balance of requirements and hopefully should open up new areas of application and new model architectures.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s