Initialising weights

For the network to work, it is crucial that the hidden nodes do different things. They cannot all be the same. They must specialise on different aspects of the input-output mapping. All we need do is start them off randomly different to get this process going.

So what should we start weights and thresholds at?

Small weights are best

Note from graph of sigmoid function that large positive or negative Summed x has a very small slope dy/dx.

dy/dx = y(1-y), and at either end, one of these terms is near zero.

Hence for large absolute x_k, is near zero, and is near zero too.

Large absolute Summed x (caused by large absolute weights) causes a small change in weights, slow learning.

Small weights give fast learning. All things being equal, small weights tend to put us in the middle of the sigmoid curve, the area of rapid change.

Small weights and fast learning is what we want at the start, when we know nothing.

All zero weights are bad

OK, so very large weights are bad and we should have small weights to start. How about zero?

Let all weights and thresholds start at zero.

Short version:

Weights and thresholds move away from zero.
But each hidden node stays the same as every other hidden node.
Output nodes become different to other output nodes, because there are different errors at each.
But each hidden node is fully connected to input and output.
So whatever changes backprop makes to that hidden node are made to other hidden nodes.
All hidden nodes stay the same.
Crippled network. Might as well have only one hidden node.

All weights the same are bad

Following from the above, identical nodes are a bad idea, even if weights are non-zero.

Multiple hidden nodes the same are useless. You can achieve the same effect with one hidden node and different weights. Consider the following.

Q. A neural network has 1 input node, n hidden nodes, and 1 output node. The weights on the input layer are all the same: w_ij = W₁. The thresholds of the hidden nodes are all the same: t_j = T. The weights on the output layer are all the same: w_jk = W₂.
This network is equivalent to a network with 1 input node, 1 hidden node, and 1 output node, where (w_ij, t_j, w_jk) = what?

Initial weights should be random, diverse, small

So now we have our strategy to initialise weights:

Initial weights should be:

random
different
small in absolute size (plus or minus)

See C++ Sample code for initialisation.

How small is "small"?
Like many other things to do with the neural network, we may need to experiment.

Stopping large weights developing

If exemplars give the correct outputs as 0 or 1, and we are using the sigmoid function, then very large weights will develop. Can't actually get 0 or 1 output without at least one weight going to plus or minus infinity.

One way to stop this is for exemplar outputs to be 0.1 to 0.9, rather than 0 to 1.