Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:

Free AI exercises


Supervised learning in practice

This section explains Supervised Learning in practice with neural networks and Back-propagation.

We have defined the Back-propagation algorithm. But there is still a lot of work for the human to do in making this work.




Designing the Network

The whole point of learning is not to design the network.
However this is only true for not designing weights and thresholds.
There are still many design decisions. For example:

  

Limit to what the network can represent

A neural net is an approximation of a non-linear continuous function.
How good that approximation is, or even can be, will be constrained by the architecture of the network.
Backprop can't do everything.

e.g. For the function:

f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)
you can't design a network with 1 input, 1 hidden unit, 1 output, and expect backprop to finish the job. There is only so much such a network can represent.

Design is part of the approximation process. Backprop finishes the details.

And design is not easy. It's not simply a matter of having thousands of hidden units. That would tend towards a lookup table with limited generalisation properties. It is an empirical loop:

repeat
  design network architecture
  use backprop to fill in weights

  if too much interference (can't form representation)
    increase number of hidden units

  if too little interference (can't predict new input)
    reduce number of hidden units
  

Designing the inputs

Design of the input node scheme (how many nodes and what values they take) is crucial to making the network work.

The network needs to be able to separate certain areas of the input space from other areas. A lot of work may have to be put into clever coding of the inputs to help the network do this.




Example of designing the inputs

Imagine that the inputs to a network are:
3 inputs each take integer values 0 .. 9
1 input takes integer values 0 .. 8
1 input takes value 0 or 1 
Number of unique inputs = 10 x 10 x 10 x 9 x 2 = 18,000.

Possible input schemes:

  1. 1 input node taking values 1 to 18,000.
  2. 5 input nodes, each taking values 0 to 9.
  3. 15 input nodes, each 0 or 1 (encoding the values 1 to 18,000 in binary).
  4. Some more separation of inputs. Think of the order of 50 to 200 input nodes under some separation scheme.
  5. 18,000 input nodes, each 0 or 1.
Will the input scheme work?
  1. Massive interference. Won't work.
  2. Massive interference. Won't work.
  3. Massive interference. Won't work.
  4. A working input scheme would be somewhere in here.
  5. Works, but is a lookup table. No point to having a neural network.
Usually, without the right coding of the inputs, the network will not work at all.

One scheme that could work in the "sweet spot" is One-hot encoding (also called 1-of-N or "1-of-C" encoding).


Example of separating the inputs:



Forget or remember the exemplars?

It does seem strange that, having been told the correct answer for x, we do not simply return this answer exactly anytime we see x in the future.

Yes, we need a prediction machine that can generate a guess for unseen inputs. But how about inputs we saw before? Why "forget" anything that we once knew? Surely forgetting things is simply a disadvantage.

We could have an extra lookup table on the side, to store the results of every input about which we knew the exact output, and only consult the network for new, unseen input.

The question is:

  1. Computational cost of comparing current input with all past exemplars.
  2. Pointless if current input is almost always different to past exemplars.

  

Example

Consider Input = continuous real numbers in robot senses.
Angle = 250.432 degrees, Angle = 250.441 degrees, etc.
Consider when n dimensions.
To see same input twice, need every dimension to be the same, to 3 decimal places. Never happens.

If exact same input never seen twice, our lookup-table grows forever (not finite-size data structure) and is never used. Even if it is (rarely) used, consider computational cost of searching it.



Learning to divide up the input space (rather than pre-defined)

The neural network solves the problem of making the input coarse-grained or fine-grained.

In the above, if the feedback the neural net gets is the same in the area 240 - 260 degrees, then it will develop weights and thresholds so that any continuous value in this zone generates roughly the same output.
On the other hand, if it receives different feedback in the zone around 245 - 255 degrees than outside that zone, then it will develop weights that lead to a (perhaps steep) threshold being crossed at 245, and one type of output generated, and another threshold being crossed at 255, and another type of output generated.

The network can learn to classify any area of the multi-dimensional input space in this way. This is especially useful for:

Each zone can generate completely different outputs.




Over-learning

Network can start with random values and learn to get rid of these.

But of course that means it can learn to get rid of good values over time as well. It can't tell the difference.

If it doesn't see an exemplar for a while, it will forget it. For all it knows, it has just started learning, and the weights it has now are just a random initialisation! It keeps learning, wiping out anything too far in past.

Learning = Forgetting!

e.g. Extreme Case - We show it one exemplar repeatedly. e.g. Show it "Input x leads to Output 1", 1 million times in a row. The "laziest" way for the network to represent this is to just send the weights to infinity (or minus infinity for Input negative), so Output = 1 no matter what the Input. i.e. Instead of "x -> 1" it learns "* -> 1"

If we show it "x -> 1" a million times, then all weights may be recruited to help "x -> 1". Normally, if we show it "x -> 1" then it does have an effect on all weights, but this effect is countered by the effects of other exemplars. The way the net resolves this tension is by specialisation, where some weights are more-or-less irrelevant in some areas of the input space. Since they have little (though, if outputs are continuous, it will always be at least non-zero, no matter how tiny) effect on the error, the backprop algorithm ensures they are hardly modified. Then when we show it "x -> 1" once, it does have an effect on the weight, but the effect is negligible.



How does specialisation happen?

How does the process of specialising work? - As the net learns, it finds that for each weight, the weight has more effect on E for some exemplars than others. It is modified more as a result of the backprop from those exemplars, making it even more influential on them in the future, and making the backprop from other exemplars progressively less important.



Strategies for Teaching

First, exemplars should have a broad spread. Show it "x -> y" alright, but if you want it to learn that some things do not lead to y   you must show it explicitly that "(NOT x) -> (NOT y)". e.g. In learning behaviour, some actions lead to good things happening, some to bad things, but most actions lead to nothing happening. If we only show it exemplars where something happened, it will predict that everything leads to something happening, good or bad. We must show it the "noise" as well.

How do we make sure it learns and doesn't forget? - If exemplars come from training set, we just make sure we keep re-showing it old exemplars. But exemplars may come from the world. So it forgets old and rare experiences.

One solution is to have learning rate C decline over time. But then it doesn't learn new experiences.

Possible solution if exemplars come from world is internal memory and replay of old experiences. Not remembering exemplars as lookup table, but remembering them to repeatedly push through network. But same question as before - Does the list grow forever?

See example of remembering exemplars and replaying, without needing infinite memory:
A Neural Network learning strategy used in Reinforcement Learning.



Test of neural network - Function approximator

Let us test the neural network with a known function to see how it does.
Consider the function:
f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)

The following uses the C++ code for neural network as function approximator.

1 real input, n hidden, 1 real output.
Never sees the same exemplar twice!

The network after having seen 1 million exemplars (top) and 5 million (bottom):


  
The learner of this function was an initially-random neural network with 1 input, 12 hidden units, and 1 output.
It improved over time, but is still not great. Why?

The reason why it has difficulty representing f is because there are too few hidden units, so it can only form a crude representation.

Remember the network has not increased in size to store this more accurate representation. It still has just 12 hidden units. It has merely adjusted its weights.

  

ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.