Coding the state-space as a lookup-table

How do we actually code the state-space? In particular how do we encode the Q(x,a) data structure? We want a data structure that can take as input (i.e. be indexed by) two vectors x and a, both members of finite sets, and return a real number output Q(x,a).

The basic idea is that because x and a are members of finite sets, we can give each of them a unique ID number, and so we can give the (x,a) combination a unique ID too. Then we just have an array of some size N of real numbers, indexed by this ID.

x = (x₁, x₂, .. x_n)
where each x_i takes one of some finite set of values v₀, v₁, ... v_m (in fact, the set of values, and the number of them, could be different for each dimension i).
e.g. x = (x₁, .. x₁₀) where for example x₇ takes values -100, 0, or 100.

To help us enumerate the states so we can construct an ID number, we simply constrain the set of values to run from 0 to m (we can translate to/from the real values when we use them). So above we say x₇ takes values 0,1,2.

Lookup tables and Generalisations

If x and a were not members of finite sets we would need to use a generalisation like a neural network.

Even with large finite, we may need generalisation.

To reduce the statespace size, we may need to be careful with our definition of state. Note that we can easily define too many possibilities, so that a huge chunk of the logical state space may not actually exist, e.g. all blank in chess board, or all kings. Could even have 90 percent of the lookup table unused.

Often we might make the world discrete and finite by dividing up the continuous input into a small number of coarse-grained divisions. But remember this in itself is a generalisation.

Example

Consider how we could enumerate the finite set of states defined as follows:

x = (x₁, x₂, x₃)
where x₁ takes values 0,1,
x₂ takes values 0,1,2,
and x₃ takes values 0,1.
We can sum this up by just having a set of variables saying how many values each dimension takes:
c₁=2, c₂=3, c₃=2.

We can enumerate as follows:

x = (0,0,0)  id = 0
     0 0 1        1
     0 1 0        2
     0 1 1        3
     0 2 0        4
     0 2 1        5
     1 0 0        6
     1 0 1        7
     1 1 0        8
     1 1 1        9
     1 2 0        10
     1 2 1        11

Note that it is not binary.

Q. When is it binary?

Consider the last state (1,2,1).
The "1" means we have done the "0"'s in the first column, i.e. we have done 1.c₂.c₃,
plus the "2" means that within the "1"s we have already done an additional 2.c₃,
plus the last "1" means we have done an additional 1.
So id(1,2,1) = 1.c₂.c₃ + 2.c₃ + 1

In general, we can enumerate a state as follows:

int id ( state x )
{
return x₁c₂c₃ + x₂c₃ + x₃
}

Enumeration scheme

States

Where x = (x₁, x₂, .. x_n),
id(x) = x₁c₂..c_n + x₂c₃..c_n + ... + x_(n-1)c_n + x_n

The number of possible states is:
nostates = c₁c₂...c_n

The state IDs run from: 0 .. (nostates-1)

Q. The first state in the enumeration is:
( 0, 0, ..., 0 )
Show using the rule above that its ID is:
0

Q. The last state in the enumeration is:
( c₁-1, c₂-1, ..., c_n-1 )
Show using the rule above that its ID is:
(c₁c₂...c_n) - 1

Exercise - Write the state2id and the id2state functions.

Actions

Similar scheme. Action IDs run from: 0 .. (noactions-1)

States x Actions

(x,a) pairs can be enumerated:

(0,0)                              id = 0 + 0
(0,1)                                   0 + 1
...                                      ...
(0,noactions-1)                         0 + noactions-1
(1,0)                         1.noactions + 0
...                                      ...
(1,noactions-1)               1.noactions + noactions-1
(2,0)                         2.noactions + 0
...                                      ...
(nostates-1,noactions-1)

The ID of an (x,a) pair is:
(id(x) * noactions) + id(a)

The number of (x,a) combinations is:
nostates*noactions

IDs run from:
0 .. (nostates*noactions)-1

Pseudo-code

So our function to access the State Space is:

real StateSpace :: at ( vector x, vector a )
{
int id = (id(x) * noactions) + id(a)
return ActualArray [ id ]
}

where ActualArray is simply an ordinary 1-dimensional array of real numbers:

real ActualArray [ nostates * noactions ]

Then we have one of these data structures to hold the Q-values:

StateSpace Q

and we can just use it like:

Q.at(x,a) = ((1-ALPHA) * Q.at(x,a)) + (ALPHA * new-estimate)

In the sample code you will see that I enumerated Actions x States rather than States x Actions as above.

This is basically a 1-1 (no collisions) hash function.
Support for Multidimensional arrays may help, if it exists in your language.