Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:

Free AI exercises


The "Back-propagation" learning algorithm

We will now define a learning algorithm for Multi-layer Neural Networks.

We assume the network will use the Sigmoid activation function.

  


Notation

  

Note: Not a great drawing. Can actually have multiple output nodes.

where     is the sigmoid function.

Input can be a vector.
There may be any number of hidden nodes.
Output can be a vector too.

Typically fully-connected. But remember that if a weight becomes zero, then that connection may as well not exist. Learning algorithm may learn to set one of the connection weights to zero. i.e. We start fully-connected, and learning algorithm learns to drop some connections.

To be precise, by making some of its input weights wij zero or near-zero, the hidden node decides to specialise only on certain inputs. The hidden node is then said to "represent" these set of inputs.




Remove thresholds

First, we are going to get rid of the thresholds. (This will be explained later.)

So we have:

and:




Error term

We sent in an input, and it generated, in the output nodes, a vector of outputs yk.
The correct answer is the vector of numbers Ok.
The error term is:

We take the squares of errors, otherwise positive and negative errors may cancel each other out.

Q. Show example of where error term fails if we don't take the squares.

There are other possible measures of error (recall Distance in n-dimensions) but we can agree that if this measure   -> 0   then all other measures of error   -> 0  

Q. Prove that if this measure of E = 0 then yk = Ok for all k.




The Learning algorithm - Back-propagation

To look at how to reduce the error, we look at how the error changes as we change the weights. We start at the layer immediately before the output. Working out the effects of earlier layers will be more complex.

First we can write total error as a sum of the errors at each node k:

E = Σ k   Ek
where Ek = 1/2 (yk - Ok)2

Now note that yk, xk and wjk each only affect the error at one particular output node k (they only affect Ek).
So from the point of view of these 3 variables, total error:

E = (a constant) + (error at node k)
hence:
(derivative of total error E with respect to any of these 3 variables) = 0 + (derivative of error at node k)
e.g.
E/yk = 0 + Ek/yk
We can see how the error changes as yk changes, or as xk changes. But note we can't change yk or xk - at least not directly. They follow in a predetermined way from the previous inputs and weights.

But we can change wjk


partial derivatives

E is a function of many variables.
We will be repeatedly holding all except one constant and seeing how E changes as just one variable is changed.
This is what we mean by using partial derivatives (also here).


Derivatives of E - output layer:



As we work backwards, the situation changes. yj feeds forward into all of the output nodes. Since:

E = (sum of errors at k)
we get:
(derivative of E) = (sum of derivatives of error at k)
xj and wij then only affect yj (though yj affects many things).

We can't (directly) change yj or xj

But we can change wij



Derivatives of E - previous layer:


To spell it out:
E/yj   = Σ k   Ek/yj
= Σ k     Ek/xk     xk/yj
= Σ k     E/xk     xk/yj



Error landscape: Changing the weights to reduce the error

Now we have an equation for each - how error changes as you change the weight.

Note some things:

  1.   E >= 0   - it can't be negative.
  2. So we can't have the line just going down forever. It must level out. Get slope = 0 at some point (perhaps many points).
  3. In fact, might not be able to get E = 0 for all exemplars given limited net architecture, so best E might still be   > 0.
  4. W can be negative. In fact, best W might be.
  5. As   W -> infinity   (or minus infinity) error almost certainly gets very bad. Best W will be something finite.

Now, to reduce error:

  1. On RHS, slope is positive:   > 0.
    Move left to reduce error:
    W := W - C
    where C is a positive constant.

  2. On LHS, slope is negative:   < 0.
    Move right to reduce error:
    W := W - C
      = W - C (negative quantity)
      = W + (positive quantity)


Hence the same update rule works for both positive and negative slopes:

W := W - C

The constant C is the LEARNING RATE  
C > 0
Typically   C < 1
(In the code we can try a wide range of C and see what happens.)

   

 

That's how we learn the weights. How do we learn thresholds?

Remember definition of threshold if using sigmoid function.
sigmoid(x-t) instead of sigmoid(x)

Biasing: Just make the thresholds into weights, on a link with constant input -1.

So instead of:

we change it to:

and we just learn the thresholds like any other weights.

i.e. Every single hidden unit (and every single output unit) has an extra input line coming apparently from nowhere with constant input -1.




We need thresholds

Remember we need thresholds, otherwise the sigmoid function is centred on zero. e.g. If no thresholds, then, no matter what the exemplars are, no matter what the I/O relationship is, and no matter what the weights are, if all inputs are 0, then output of every single hidden node is ..

Q. Is what?

Similarly, for a node that specialises on n of the inputs (weight = 0 for others), then if that subset of inputs are all 0, that node's output must be ..





Summary: The Back-propagation algorithm

We can now put everything together. The algorithm is:
  
Repeat:
  1. Send in inputs Ii for this exemplar.
  2. Calculate outputs yk
  3. Get given correct outputs Ok
  4. Measure E.

  5. wjk weights:
    1. Calculate all the   =   yk ( 1 - yk ) ( yk - Ok )
    2. Calculate all the   =     yj
    3. For all j,k:

  6. wij weights:
    1. Calculate all the   =   yj ( 1 - yj )   Σ k ( wjk )
    2. Calculate all the   =   Ii
    3. For all i,j:

  7. Repeat (next exemplar).
  




Summary of how it responds to errors

  1. Note that   yj is positive, and yk(1-yk) is positive, so if (yk-Ok) is positive (i.e. output is too big), then is positive, and our learning rule reduces the weights. This will reduce the output.

  2. Similarly you can see that if the output yk is too small, the learning rule increases the weights, which will increase the output.

  3. Note that if E = 0, the update rule will cause no change in weights wij or weights wjk. Why?

  4. Note that if Ii = 0, none of the weights wij (for all j) will be changed. Why? Why is this a good thing?

  5. Note that if yj = 0, none of the weights wjk (for all k) will be changed. Why? Why is this a good thing?



ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.