Neural Networks in RL

This page combines Q-learning in Reinforcement Learning with Neural Networks. So you would need to understand both concepts first.

With Q-learning in a small state-space, we would store the Q-values Q(x,a) in a lookup table.

Question - Why does a lookup table approach not work if x or a contain real-number components?

Answer - After some experience our lookup table would look like this:

x               Q(x,a1)         Q(x,a2)         Q(x,a3)

313.4009        0              [0.244]          0
313.7562       [1.355]          0               0
313.7564        0              [0.155]          0

current input x:

313.3170        0               0               0

With real numbered states, every state is new. x has not been visited, so we don't know what to do. Every state x will have precisely one Q-value in precisely one column, since we only visited it once, and tried one action.

NOTE - In fact, technically, a digital computer system will never deal with actual real numbers. There will always be some limit to the no. of decimal places (because there is limit to the no. of bytes the floating-point number is stored in). So technically, a lookup table is possible for any computer/robot system. But could be massive / bigger than memory.

In practice, if large no. of states / large no. of decimal places, get same type of problems as if actually real-valued: Never see same x twice during even a very long experiment, Lookup table huge (if not actually infinite), etc.

So in this situation instead of using a lookup table to represent our state-space in RL, we will use a neural network.
Input = (x,a).
Output = Q(x,a).
Need to transfer info from Q(x,a) to "nearby" (x,a) combinations ("nearby" x's, "nearby" a's).

What is a generalisation (like a neural network) for?

To code for states that haven't been experienced yet.
Give us a bias. Prediction. Extrapolation.
Even in a small discrete space, this is useful.
In a large discrete space, very useful (each individual different state visited very few times).
Need to apply our knowledge of x to nearby state x':
- Simplest method of doing this is perhaps nearest-neighbour. Given input x, find the nearest previously seen input in the n-dimensional space, and return its Q-values as our initial starting values.
- Neural network is more complex - may return new Q-values.
- An issue is then: When you learn, do you update your neighbours as well?
Prediction crucial if actions may be dangerous/costly. e.g. Animals can't just make millions of cost-free random actions. They must extrapolate. "Do the experiments in your head".
So even in a discrete space, you could use a generalisation as a form of cautious exploration policy.
- It should be possible to unlearn any initial bias if the area around the new state turns out to be radically different.
- The bias may be bad, but Q-values all zero, or initially random, is likely to be worse.
To code for continuous spaces (at all). Never see the same state twice. Can't have lookup.
Simply to practically code large discrete spaces. Memory problems:
- Size of memory.
- Time taken to initialise it / traverse it.
The length of time to visit every state multiple times to get good Q-estimate.

Disadvantages of a generalisation

Interference (this is also listed as an advantage).
No convergence proof (Q-learning convergence proof is only valid for lookup tables).
May converge to local optimum.

Designing the Network

See Designing the Network

In particular, our choice of how to separate the input nodes will decide how effective our network can be in learning the Q-values.

Might seem strange that we have to make these design decisions to get an effective network, but remember that our coarse-grained discrete lookup table is a generalisation itself. It is already a hand-designed initial approximation, deciding to separate directions into 8 compass points.

Designing the Inputs

Consider where x = (f,n,p,i), where f takes values 0..9, n and p also 0..9, i takes values 0,1, and a = 0..8.

No. of possible (x,a) combinations = 18,000.

We could have 1 input node which takes values 1 to 18,000.
- Massive interference.
You might be able to separate the inputs if they were grouped together. e.g. The net could separate Inputs 15000 and greater (i.e. 15000 - 18000) by having a hidden node with weight 1/15000 and threshold 1.

Q. How would you separate, say, the inputs between 15,000 and 16,000?

The problem is, the inputs you want to group together are likely to be scattered all over the 1 - 18000 range.
e.g. Separate (*,*,9,*,*) from the rest
Separate (*,9,0-7,*,*) from the rest

Interference

Remember that when you learn in a neural net, you backprop the error and change all weights, thus interfering with the output for all other inputs. We adjust the weights to make the net better for x. That may make it worse for x'. If inputs lead to different outputs, we need to learn to separate them.

Only when weights or inputs are zero along some connection is there absolutely no interference. Zero weights/inputs are our main trick in helping the net separate the inputs.

Say hidden node has threshold t_j = 50, and only connected to 1 input, with w_ij = 1 (i.e. weight zero on other links). Then it only fires for the region where input > 50.
If w_ij = - 1, threshold t_j = -50, it only fires for input < 50.
In this way different parts of the net fire for different regions of n-dimensional input space.

But we must give it enough links and enough hidden nodes so it can specialise.

Designing the Inputs (continued)

We could encode the "ID" of the (x,a) pair, on a scale from 1 to 18,000, in binary. Then we could input it through 15 input nodes, each one binary 0 or 1.
- 0 or 1 is good - it helps the next layer separate different inputs. But depending on our ID encoding scheme here, similar and dissimilar input might be jumbled up. Better solution is to just do some grouping of similar input into similar input nodes.
We could have 18,000 input nodes with values 0 or 1.
- Excellent separation, but might as well have a lookup table.
Or 5 input nodes taking values 0..9.

Input scheme - Conclusion

Noting that each component of the input takes only a small number of discrete values, have 10 + 10 + 10 + 2 + 9 = 41 input nodes, each taking values 0 or 1.
(Every input will have precisely 5 1's and 36 0's)

This is called One-hot encoding (also called 1-of-N or "1-of-C" encoding).

We can rank all our schemes, from most interference to least (lookup table):

1 input node taking values 1 to 18,000
5 input nodes, each taking values 0 to 9
15 input nodes, each 0 or 1 (encoding the values 1 to 18,000 in binary)
41 input nodes, each 0 or 1
18,000 input nodes, each 0 or 1

The net didn't work at all for the first 3. The last is basically a lookup table. The net only started to work when we increased the separation to the level of the 4th one above.

One net per action

In fact, to separate the inputs further, because we have only a small number of discrete actions, we could have one net per action. We have 9 separate nets. Each takes a vector input x and produces a floating point output:
Q_a(x) = Q(x,a)
Now each net has 10 + 10 + 10 + 2 = 32 input nodes, each 0 or 1.

Output

Q-value = x_k - t_k
(No need to normalise it with sigmoid. Goes from - infinity to infinity.)