Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

 

Search:


Back-propagation is a heuristic

What is Back-propagation doing?
Can Back-propagation guarantee to find an optimal solution?
Can Back-propagation fail?




Hill-climbing/descending

First, consider the Back-propagation algorithm.

It is based on descending an Error landscape. We change the weights according to slope to reduce the error.

This is Hill-climbing/descending (aka "gradient descent/ascent").

  
Notes:
  1. If slope is large then our rule causes us to make large jumps in the direction that seems to be downhill.

  2. We do not know this will be downhill. We do not see the whole landscape. All we do is change the x value and hope the y value is smaller. It may turn out to have been a jump uphill.

  3. As we approach a minimum, the slope must approach zero and hence we make smaller jumps. As we get closer to the minimum, the slope is even closer to zero and so we make even smaller jumps. Hence the system converges to the minimum. Change in weights slows down to 0.

  4. If slope is zero we do not change the weight.

  5. As long as slope is not zero we will keep changing w. We will never stop until slope is zero.

  


Multiple descents

  1. Remember this is just the landscape for this weight. There are other landscapes for the other weights.

  2. Also, for this weight, this is just the landscape for this exemplar. On the next time step we will be adjusting this weight in relation to the error landscape for the next exemplar. We mix up the exemplars. Rather than single-mindedly descending one mountain, we are trying to descend multiple mountains at once. Different weights do eventually end up specialised on different zones of the input space, i.e. specialised on one particular descent (or rather a family of similar descents). But this takes time. At the start, things are much more noisy. See How does specialisation happen?

  3. Since all the other weights are constantly changing, even next time round the same exemplar is shown to the same weight, the error landscape will be different.

  4. We don't actually have an equation for the landscape. All we have is a single point on the x axis and a corresponding value for E and the slope (at this point only). We have to make a decision based on this. Next time round there will be a different landscape and a different point.

What is also often done is that the learning rate C starts high and declines as learning goes on.




No convergence proof

With single-layer networks, for a function that is linearly separable, where we know that some configuration of the network can separate them, there can be a Convergence Proof, in the sense that we can prove that the network will learn to separate the exemplars.

For a multi-layer net, what does a Convergence Proof mean? That we can approximate the function f? To what level of accuracy? Anyway, which function? A more complex function requires a larger network. Maybe with the current network architecture we can only represent f crudely.

Convergence Proof could prove that it converges to something, i.e. stops changing. But perhaps to local optimum rather than global optimum.

Summary: Back-prop is a heuristic. It cannot guarantee to find the globally optimal solution.

  


Reading

The following reading explains the nature of back-propagation.
  
  1. Rumelhart et al - Defines back-prop as a heuristic.

      
    
  2. Kolen and Pollack - Considers what solution (of many) the heuristic will find.
    • "Back Propagation is Sensitive to Initial Conditions". Kolen, J. F. and Pollack, J. B. (1990). Complex Systems, 4,3, 269-280.
    • They sum it up: "Rumelhart et al. made a confident statement: for many tasks, "the network rarely gets stuck in poor local minima that are significantly worse than the global minima." ... The convergence claim was based solely upon their empirical experience with the back propagation technique. Since then, Minsky and Papert ... have argued that there exists no proof of convergence for the technique, and several researchers ... have found that the convergence time must be related to the difficulty of the problem, otherwise an unsolved computer science question (P=NP) would finally be answered. We do not wish to make claims about convergence of the technique in the limit (with vanishing step-size), or the relationship between task and performance, but wish to talk about a pervasive behavior of the technique which has gone unnoticed for several years: the sensitivity of back propagation to initial conditions."
    • They point out that, as a heuristic, it will find some local optimum. Given the same training set, which local optimum it finds is decided entirely by the initial weights.

      Note on search for weights

      • We cannot guarantee to find the global optimal set of weights.
      • We could in theory ignore and search for the correct network by doing an exhaustive search on all sets of weights.
      • But search space is too big.
      • The initial (random) weights and constrain our search to make it practical. But does so by limiting from the start the solutions we will find.
      
    
  3. Pollack - Says a heuristic approach is normal.
    • "No Harm Intended: A Review of the Perceptrons expanded edition". Pollack, J. B. (1989). Journal of Mathematical Psychology, 33, 3, 358-365.
    • He points out that this is just normal in AI (to have a heuristic that cannot search the whole space): "Minsky and Papert point out, rightly, that Rumelhart et al have no result, like the Perceptron Convergence Theorem ... that the generalized delta rule is guaranteed to find solutions when they exist, and that "little has been proved about the range of problems upon which [the generalized delta rule] works both efficiently and dependably." ... On the other hand, we also know that for certain classes of very hard problems, there are no algorithms which scale significantly better than exhaustive search, unless they are approximations. But there are many such impractical and approximate methods which are in practical use worldwide. Despite the fact that the Traveling Salesman Problem is NP-complete, salespersons do travel, and often optimally."
  


Timeline

1950s-60s
Learning rules for single-layer nets.

Also interesting to note that HLLs primitive or non-existent. Computer models often focused on raw hardware/brain/network models.

1969
Perceptrons, Minsky and Papert. Limitations of single-layer nets apparent.
People (in fact, Minsky and Papert themselves!) were aware that multi-layer nets could perform more complex classifications. That was not the problem. The problem was the absence of a learning rule for multi-layer nets.

Meanwhile, HLLs had been invented, and people were excited about them, seeing them as possibly the language of the mind. Connectionism went into decline. Explicit, symbolic approaches dominant in AI. Logic. HLLs.

1974
Back-propagation (learning rule for multi-layer) invented by Werbos, but not noticed.

1986
Back-prop popularised. Connectionism takes off again.

Also HLLs no longer so exciting. Seen as part of Computer Engineering, little to do with brain. Increased interest in numeric and statistical models.



ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.