Back-propagation is a heuristic
What is Back-propagation doing?
Can Back-propagation guarantee to find an optimal solution?
Can Back-propagation fail?
Let us consider all of these questions.
Hill-climbing/descending
First, consider the
Back-propagation algorithm.
It is based on descending an
Error landscape.
We change the weights according to slope
to reduce the error.
This is Hill-climbing/descending
(aka "gradient descent/ascent").
Notes:
-
If slope is large then our rule causes us to make large jumps
in the direction that seems to be downhill.
- We do not know this will be downhill.
We do not see the whole landscape.
All we do is change the x value and hope the y value is smaller.
It may turn out to have been a jump uphill.
-
As we approach a minimum, the slope must approach zero
and hence we make smaller jumps.
As we get closer to the minimum, the slope is even closer to zero
and so we make even smaller jumps.
Hence the system converges to the minimum.
Change in weights slows down to 0.
-
If slope is zero we do not change the weight.
- As long as slope is not zero
we will keep changing w.
We will never stop until slope is zero.
How do we know this won't get stuck in local minima?
Answer - we don't.
Multiple descents
- Remember this is just the landscape for this weight.
There are other landscapes for the other weights.
- Also, for this weight, this is just the landscape for this exemplar.
On the next time step we will be adjusting this weight
in relation to the error landscape for the next exemplar.
We mix up the exemplars.
Rather than single-mindedly descending one mountain,
we are trying to descend multiple mountains at once.
Different weights do eventually
end up specialised on different zones of the input space,
i.e. specialised on one particular descent
(or rather a family of similar descents).
But this takes time.
At the start, things are much more noisy.
See How does specialisation happen?
- Since all the other weights are constantly changing,
even next time round the same exemplar is shown to the same weight,
the error landscape will be different.
- We don't actually have an equation for the landscape.
All we have is a single point on the x axis
and a corresponding value for E and the slope
(at this point only).
We have to make a decision based on this.
Next time round there will be a different landscape
and a different point.
What is also often done is that the learning rate C starts high
and declines as learning goes on.
Let us consider why a "noisy" gradient descent
(large jumps, large weight changes, including possible jumps uphill)
- where the noise dies down as time goes on
(approach area of global minimum, smaller jumps,
smaller weight changes, settle in minimum)
- is often a better idea
than a strict downhill descent.
Consider a landscape like the following:
If the ball strictly moves downhill, and it starts at a random point,
it is more or less equally likely to end up in the local minimum
as in the global one.
If we shake the whole system, however, we are more likely to shake it from local
to global than vice versa.
If we shake violently, getting it out of minima becomes very probable,
but at the cost of making it more likely to actually shake it out of
the global minimum into a local one.
If we shake gently, getting it out of minima is harder
but when it does it is more likely to go from local to global.
So what we do is start by shaking violently,
and gradually shake less as time goes on.
It seems unsatisfactory that we have a method that cannot
guarantee to find the global minimum.
The problem comes from the blindness of our exploration.
Below is the actual landscape.
But we are never actually shown the overall shape of the landscape.
All we do is suggest points on the X-axis (weights),
and are told the point on the Y-axis.
It's like we have a torch that shows us just the point on which we are standing,
and everything else is in darkness.
The problem is we do not actually roll down hill (continuously).
Rather we make a finite number of jumps,
and we only ever explore a finite number of points.
We can repeat this for lots of points,
but we need an infinite number of points to be sure
that we have explored the whole curve.
If the curve is sufficiently fine-detailed
(see discussion of chaotic function)
then we are not guaranteed to find that global minimum
without an infinite amount of exploration:
X-axis - weight parameter.
Y-axis - error. Change weight parameter to find minimum error.
Hard to find minimum error unless program can see the whole function (as human can here).
If explore by finite sampling of a few points, may never find the pit.
Note the pit cannot have vertical sides.
To be a function (one y for each x)
its sides must have slope, no matter how steep.
Image thanks to
Victor Terron,
replacing my poorly hand-drawn version.
Global optimum may be hard to find and hard to stay at
Note if by chance we land
in the pit,
a point on one of its steep walls,
we will find very steep slope,
hence big change in w.
Hence we jump
out of the pit!
- Should we then modify rule so for steepest slopes, make small w change instead of large?
- But you need small w change for small slope too,
otherwise it will never converge to slope 0.
So your algorithm is small w change for all slopes?
- Problem is that here, all things being otherwise equal,
high E means high slope.
Hence small change in w for high E!
- Could you remember altitude?
If jump brings you to higher altitude, go back.
- Danger (if jump range is restricted)
of being same as strict move downhill -
end up in local minimum.
We want to be able to jump over barriers to get to global minimum.
- Could go back and try another jump, and repeatedly try jump
(over the entire range?)
until get lower.
In any case, we can make the mouth of the pit arbitrarily small,
reducing to near zero the prob. of ever finding it
no matter what you use.
The point remains that no algorithm of finite sampling can ever guarantee
to find the global optimum.
Evolutionary algorithms are even more blind.
At least here, we have a measure of E
and a measure of
In a genetic algorithm we do not have this information.
But of course that is the point of a GA
- to be able to learn
despite the fact that we have
less information / feedback.
With single-layer networks, for a function that is linearly separable,
where we know that some configuration of the network
can separate them,
there can be a
Convergence Proof,
in the sense that we can prove that
the network will learn to separate the exemplars.
For a multi-layer net, what does a
Convergence Proof mean?
That we can approximate the function f?
To what level of accuracy?
Anyway, which function?
A more complex function requires a larger network.
Maybe with the current network architecture
we can only represent f crudely.
Convergence Proof could prove that it converges to something,
i.e. stops changing.
But perhaps to local optimum
rather than global optimum.
Summary:
Back-prop is a heuristic.
It cannot guarantee to find the globally optimal solution.
The following reading explains the nature of back-propagation.
- Rumelhart et al - Defines back-prop as a heuristic.
- Kolen and Pollack - Considers what solution (of many) the heuristic will find.
- "Back Propagation is Sensitive to Initial Conditions".
Kolen, J. F. and Pollack, J. B. (1990).
Complex Systems, 4,3, 269-280.
- They sum it up:
"Rumelhart et al. made a confident statement: for many tasks,
"the network rarely gets
stuck in poor local minima that are significantly worse
than the global minima." ...
The convergence claim was based solely upon their empirical experience with the back
propagation technique. Since then, Minsky and Papert ... have argued that there exists no
proof of convergence for the technique, and several researchers ... have found that the convergence time must be related to the difficulty of the
problem, otherwise an unsolved computer science question (P=NP) would finally be answered.
We do not wish to make claims about convergence of the technique in the limit (with vanishing
step-size), or the relationship between task and performance, but wish to talk about a pervasive
behavior of the technique which has gone unnoticed for several years: the sensitivity of back
propagation to initial conditions."
- They point out that, as a heuristic,
it will find some local optimum.
Given the same training set,
which local optimum it finds
is decided entirely by the initial
weights.
Note on search for weights
- We cannot guarantee to find the global optimal set of weights.
-
We could in theory ignore
and
search for the correct network
by doing an exhaustive search on all
sets of weights.
- But search space is too big.
-
The initial (random) weights and
constrain our search
to make it practical.
But does so by limiting
from the start the solutions
we will find.
- Pollack - Says a heuristic approach is normal.
- "No Harm Intended: A Review of the Perceptrons expanded edition".
Pollack, J. B. (1989).
Journal of Mathematical Psychology, 33, 3, 358-365.
- He points out that this is just normal in AI
(to have a heuristic that cannot search the whole space):
"Minsky and Papert point out, rightly, that Rumelhart et al have no result,
like the
Perceptron Convergence Theorem ... that the generalized delta rule is
guaranteed to find solutions when they exist, and that "little has been proved about the
range of problems upon which [the generalized delta rule] works both efficiently and
dependably." ...
On the other hand, we also know that for certain classes of very hard problems,
there are no algorithms which scale significantly better than exhaustive search, unless
they are approximations. But there are many such impractical and approximate methods
which are in practical use worldwide. Despite the fact that the Traveling Salesman Problem
is NP-complete, salespersons do travel, and often optimally."
- 1950s-60s
- Learning rules for single-layer nets.
Also interesting to note that HLLs primitive or non-existent.
Computer models often focused on raw hardware/brain/network models.
-
1969
- Perceptrons, Minsky and Papert.
Limitations of single-layer nets apparent.
People (in fact, Minsky and Papert themselves!)
were aware that multi-layer nets
could perform more complex classifications.
That was not the problem.
The problem was the absence of a learning rule
for multi-layer nets.
Meanwhile, HLLs had been invented,
and people were excited about them,
seeing them as possibly the language of the mind.
Connectionism went into decline.
Explicit, symbolic approaches dominant in AI. Logic. HLLs.
-
1974
- Back-propagation (learning rule for multi-layer) invented
by Werbos,
but not noticed.
-
1986
- Back-prop popularised.
Connectionism takes off again.
Also HLLs no longer so exciting.
Seen as part of Computer Engineering, little to do with brain.
Increased interest in numeric and statistical models.