Back-propagation is a heuristic

What is Back-propagation doing?
Can Back-propagation guarantee to find an optimal solution?
Can Back-propagation fail?

Let us consider all of these questions.

Hill-climbing/descending

First, consider the Back-propagation algorithm.

It is based on descending an Error landscape. We change the weights according to slope to reduce the error.

This is Hill-climbing/descending (aka "gradient descent/ascent").

Notes:

If slope is large then our rule causes us to make large jumps in the direction that seems to be downhill.
We do not know this will be downhill. We do not see the whole landscape. All we do is change the x value and hope the y value is smaller. It may turn out to have been a jump uphill.
As we approach a minimum, the slope must approach zero and hence we make smaller jumps. As we get closer to the minimum, the slope is even closer to zero and so we make even smaller jumps. Hence the system converges to the minimum. Change in weights slows down to 0.
If slope is zero we do not change the weight.
As long as slope is not zero we will keep changing w. We will never stop until slope is zero.

How do we know this won't get stuck in local minima?
Answer - we don't.

Multiple descents

Remember this is just the landscape for this weight. There are other landscapes for the other weights.
Also, for this weight, this is just the landscape for this exemplar. On the next time step we will be adjusting this weight in relation to the error landscape for the next exemplar. We mix up the exemplars. Rather than single-mindedly descending one mountain, we are trying to descend multiple mountains at once. Different weights do eventually end up specialised on different zones of the input space, i.e. specialised on one particular descent (or rather a family of similar descents). But this takes time. At the start, things are much more noisy. See How does specialisation happen?
Since all the other weights are constantly changing, even next time round the same exemplar is shown to the same weight, the error landscape will be different.
We don't actually have an equation for the landscape. All we have is a single point on the x axis and a corresponding value for E and the slope (at this point only). We have to make a decision based on this. Next time round there will be a different landscape and a different point.

What is also often done is that the learning rate C starts high and declines as learning goes on.

"Noisy" descent

Let us consider why a "noisy" gradient descent (large jumps, large weight changes, including possible jumps uphill) - where the noise dies down as time goes on (approach area of global minimum, smaller jumps, smaller weight changes, settle in minimum) - is often a better idea than a strict downhill descent. Consider a landscape like the following:

If the ball strictly moves downhill, and it starts at a random point, it is more or less equally likely to end up in the local minimum as in the global one.

If we shake the whole system, however, we are more likely to shake it from local to global than vice versa. If we shake violently, getting it out of minima becomes very probable, but at the cost of making it more likely to actually shake it out of the global minimum into a local one. If we shake gently, getting it out of minima is harder but when it does it is more likely to go from local to global.

So what we do is start by shaking violently, and gradually shake less as time goes on.

Stochastic gradient descent (SGD)

No method of finite sampling can guarantee to find the global minimum (Blindness)

It seems unsatisfactory that we have a method that cannot guarantee to find the global minimum.

The problem comes from the blindness of our exploration. Below is the actual landscape. But we are never actually shown the overall shape of the landscape. All we do is suggest points on the X-axis (weights), and are told the point on the Y-axis. It's like we have a torch that shows us just the point on which we are standing, and everything else is in darkness.

The problem is we do not actually roll down hill (continuously). Rather we make a finite number of jumps, and we only ever explore a finite number of points.

We can repeat this for lots of points, but we need an infinite number of points to be sure that we have explored the whole curve.

If the curve is sufficiently fine-detailed (see discussion of chaotic function) then we are not guaranteed to find that global minimum without an infinite amount of exploration:

X-axis - weight parameter.
Y-axis - error. Change weight parameter to find minimum error.
Hard to find minimum error unless program can see the whole function (as human can here).
If explore by finite sampling of a few points, may never find the pit.
Note the pit cannot have vertical sides. To be a function (one y for each x) its sides must have slope, no matter how steep.
Image thanks to Victor Terron, replacing my poorly hand-drawn version.

Global optimum may be hard to find and hard to stay at

Note if by chance we land in the pit, a point on one of its steep walls, we will find very steep slope, hence big change in w. Hence we jump out of the pit!

Should we then modify rule so for steepest slopes, make small w change instead of large?
- But you need small w change for small slope too, otherwise it will never converge to slope 0. So your algorithm is small w change for all slopes?
- Problem is that here, all things being otherwise equal, high E means high slope. Hence small change in w for high E!
Could you remember altitude? If jump brings you to higher altitude, go back.
- Danger (if jump range is restricted) of being same as strict move downhill - end up in local minimum. We want to be able to jump over barriers to get to global minimum.
- Could go back and try another jump, and repeatedly try jump (over the entire range?) until get lower.

In any case, we can make the mouth of the pit arbitrarily small, reducing to near zero the prob. of ever finding it no matter what you use. The point remains that no algorithm of finite sampling can ever guarantee to find the global optimum.

Evolutionary algorithms are even more blind. At least here, we have a measure of E and a measure of
In a genetic algorithm we do not have this information. But of course that is the point of a GA - to be able to learn despite the fact that we have less information / feedback.

No convergence proof

With single-layer networks, for a function that is linearly separable, where we know that some configuration of the network can separate them, there can be a Convergence Proof, in the sense that we can prove that the network will learn to separate the exemplars.

For a multi-layer net, what does a Convergence Proof mean? That we can approximate the function f? To what level of accuracy? Anyway, which function? A more complex function requires a larger network. Maybe with the current network architecture we can only represent f crudely.

Convergence Proof could prove that it converges to something, i.e. stops changing. But perhaps to local optimum rather than global optimum.

Summary: Back-prop is a heuristic. It cannot guarantee to find the globally optimal solution.

Reading

The following reading explains the nature of back-propagation.

Rumelhart et al - Defines back-prop as a heuristic.
- Parallel Distributed Processing, Rumelhart and McClelland, 1986. DCU Library 153.RUM.
  - Ch.8: "Learning Internal Representations by Error Propagation", David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams.
- "Learning representations by back-propagating errors", David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams, Nature, volume 323, pages 533-536 (1986).
- Rumelhart et al. do not have a proof like Rosenblatt. They can only argue for back-prop as a heuristic, which often (but not always) works well.
```
  
```
Kolen and Pollack - Considers what solution (of many) the heuristic will find.
- "Back Propagation is Sensitive to Initial Conditions". Kolen, J. F. and Pollack, J. B. (1990). Complex Systems, 4,3, 269-280.
- They sum it up: "Rumelhart et al. made a confident statement: for many tasks, "the network rarely gets stuck in poor local minima that are significantly worse than the global minima." ... The convergence claim was based solely upon their empirical experience with the back propagation technique. Since then, Minsky and Papert ... have argued that there exists no proof of convergence for the technique, and several researchers ... have found that the convergence time must be related to the difficulty of the problem, otherwise an unsolved computer science question (P=NP) would finally be answered. We do not wish to make claims about convergence of the technique in the limit (with vanishing step-size), or the relationship between task and performance, but wish to talk about a pervasive behavior of the technique which has gone unnoticed for several years: the sensitivity of back propagation to initial conditions."
- They point out that, as a heuristic, it will find some local optimum. Given the same training set, which local optimum it finds is decided entirely by the initial weights.
  Note on search for weights
  - We cannot guarantee to find the global optimal set of weights.
  - We could in theory ignore and search for the correct network by doing an exhaustive search on all sets of weights.
  - But search space is too big.
  - The initial (random) weights and constrain our search to make it practical. But does so by limiting from the start the solutions we will find.
```
  
```
Pollack - Says a heuristic approach is normal.
- "No Harm Intended: A Review of the Perceptrons expanded edition". Pollack, J. B. (1989). Journal of Mathematical Psychology, 33, 3, 358-365.
- He points out that this is just normal in AI (to have a heuristic that cannot search the whole space): "Minsky and Papert point out, rightly, that Rumelhart et al have no result, like the Perceptron Convergence Theorem ... that the generalized delta rule is guaranteed to find solutions when they exist, and that "little has been proved about the range of problems upon which [the generalized delta rule] works both efficiently and dependably." ... On the other hand, we also know that for certain classes of very hard problems, there are no algorithms which scale significantly better than exhaustive search, unless they are approximations. But there are many such impractical and approximate methods which are in practical use worldwide. Despite the fact that the Traveling Salesman Problem is NP-complete, salespersons do travel, and often optimally."

Timeline

1950s-60s

Learning rules for single-layer nets.

Also interesting to note that HLLs primitive or non-existent. Computer models often focused on raw hardware/brain/network models.

1969

Perceptrons, Minsky and Papert. Limitations of single-layer nets apparent.
People (in fact, Minsky and Papert themselves!) were aware that multi-layer nets could perform more complex classifications. That was not the problem. The problem was the absence of a learning rule for multi-layer nets.

Meanwhile, HLLs had been invented, and people were excited about them, seeing them as possibly the language of the mind. Connectionism went into decline. Explicit, symbolic approaches dominant in AI. Logic. HLLs.

1974

Back-propagation (learning rule for multi-layer) invented by Werbos, but not noticed.

1986

Back-prop popularised. Connectionism takes off again.

Also HLLs no longer so exciting. Seen as part of Computer Engineering, little to do with brain. Increased interest in numeric and statistical models.