School of Computing. Dublin City University.
Online coding site: Ancient Brain
coders JavaScript worlds
We have defined the Back-propagation algorithm. But there is still a lot of work for the human to do in making this work.
The whole point of learning is
not to design the network.
However this is only true for not designing weights and thresholds.
There are still many design decisions.
For example:
e.g. For the function:
f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)you can't design a network with 1 input, 1 hidden unit, 1 output, and expect backprop to finish the job. There is only so much such a network can represent.
Design is part of the approximation process. Backprop finishes the details.
And design is not easy. It's not simply a matter of having thousands of hidden units. That would tend towards a lookup table with limited generalisation properties. It is an empirical loop:
repeat design network architecture use backprop to fill in weights if too much interference (can't form representation) increase number of hidden units if too little interference (can't predict new input) reduce number of hidden units
The network needs to be able to separate certain areas of the input space from other areas. A lot of work may have to be put into clever coding of the inputs to help the network do this.
3 inputs each take integer values 0 .. 9 1 input takes integer values 0 .. 8 1 input takes value 0 or 1
Possible input schemes:
One scheme that could work in the "sweet spot" is One-hot encoding (also called 1-of-N or "1-of-C" encoding).
Yes, we need a prediction machine that can generate a guess for unseen inputs. But how about inputs we saw before? Why "forget" anything that we once knew? Surely forgetting things is simply a disadvantage.
We could have an extra lookup table on the side, to store the results of every input about which we knew the exact output, and only consult the network for new, unseen input.
The question is:
If exact same input never seen twice, our lookup-table grows forever (not finite-size data structure) and is never used. Even if it is (rarely) used, consider computational cost of searching it.
In the above, if the feedback the neural net gets is the same
in the area 240 - 260 degrees,
then it will develop weights and thresholds so that any
continuous value in this zone generates roughly the same output.
On the other hand, if it receives
different feedback in the zone around 245 - 255 degrees
than outside that zone, then it will develop weights
that lead to a (perhaps steep) threshold being crossed at 245,
and one type of output generated,
and another threshold being crossed at 255,
and another type of output generated.
The network can learn to classify any area of the multi-dimensional input space in this way. This is especially useful for:
Network starts with random values and learns to
get rid of these.
But that means it can learn to get rid of good
values over time.
If it doesn't see an exemplar for a while,
it will "forget" it.
Meaning weights move in a different direction.
For all it knows, it has just started learning,
and the weights it has now are just a random initialisation.
Learning = Forgetting!
Extreme Case: We show it one exemplar repeatedly:
It needs to learn "non x" leads to "non y".
How does the process of specialising work? - As the net learns, it finds that for each weight, the weight has more effect on E for some exemplars than others. It is modified more as a result of the backprop from those exemplars, making it even more influential on them in the future, and making the backprop from other exemplars progressively less important.
First, exemplars should have a broad spread. Show it "x -> y" alright, but if you want it to learn that some things do not lead to y you must show it explicitly that "(NOT x) -> (NOT y)". e.g. In learning behaviour, some actions lead to good things happening, some to bad things, but most actions lead to nothing happening. If we only show it exemplars where something happened, it will predict that everything leads to something happening, good or bad. We must show it the "noise" as well.
How do we make sure it learns and doesn't forget? - If exemplars come from training set, we just make sure we keep re-showing it old exemplars. But exemplars may come from the world. So it forgets old and rare experiences.
One solution is to have learning rate C decline over time. But then it doesn't learn new experiences.
Possible solution if exemplars come from world is internal memory and replay of old experiences. Not remembering exemplars as lookup table, but remembering them to repeatedly push through network. But same question as before - Does the list grow forever?
f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)
The following uses the C++ code for neural network as function approximator.
1 real input, n hidden, 1 real output.
Never sees the same exemplar twice!
The network after having seen 1 million exemplars (top) and 5 million (bottom):
The learner of this function was an initially-random neural network with 1 input, 12 hidden units, and 1 output.
The reason why it has difficulty representing f is because there are too few hidden units, so it can only form a crude representation.
Remember the network has not increased in size to store this more accurate representation. It still has just 12 hidden units. It has merely adjusted its weights.