The Control Policy

In real life, since we cannot visit each (x,a) an infinite number of times, and then exploit our knowledge, we can only approximate Q-learning. We could do a large finite amount of random exploration, then exploit our knowledge. This works ok on small statespaces.

On the larger-statespace Q-learning problems, random exploration takes too long to focus on the best actions, and also in a generalisation it will cause a huge amount of noise.

Probabilistic world needs multiple samples

Consider - In state x, choice of 50 actions. Pick 1, useless.
Eventually, 10,000 timesteps later, we're in x again, pick another 1 (action), useless.
Eventually, in x again and by chance we pick the future a^*. Normally this gives r with p=0.8. Unfortunately this is one of the p=0.2 times so it returns 0. We can't tell difference between it and the other actions.
Even worse, next time we are in x, we try a worse action b which normally (p=0.9) gives reward 0, but just this once returns r = 1. So right now we think b is the best.
We have to try b multiple times in x to realise how good it is, and we have to try a^* multiple times in x to see its probabilities.

Random learning

Random learning: Take random action in each state.
Do random learning for finite time, then exploit.

Problem - May only ever visit boring, mainstream states - only small part of the entire statespace.
Interesting area of statespace: Hard to visit. Need skilled sequence of actions to even get into these states. But high r's.

Non-random exploration will be a heuristic

When can't try all a in all x repeatedly in the finite time available:

Non-random exploration policy to do "smart" exploration in the finite time available. Get focused on a small no. of actions quickly. Get increasingly narrow-minded. Need a heuristic rule: some rule to focus on promising actions and states.

Note however that if we have any policy other than try all a's endlessly in all x's, we may be unlucky and miss the optimal actions. A heuristic is "a rule that may not work" (could be unlucky). In a large state-space we may have no choice and have to take that chance.

Boltzmann "soft max" probability distribution

Here is a nice heuristic for non-random exploration.

Given a state x, the agent tries out action a with probability:

This is the Boltzmann or "soft max" distribution.

Boltzmann "soft max" distribution

Q-learning with Boltzmann "soft max"

Instead of random actions, we tend to pick actions that have generated rewards before. As time goes on we only focus on one or two actions in each state.

Highest Q will always be strictly more probable.

The temperature T controls the amount of exploration (the probability of executing actions other than the one with the highest Q-value). If T is high, or if Q-values are all the same, this will pick a random action. If T is low and Q-values are different, it will tend to pick the action with the highest Q-value.

The idea is to start with high exploration and decrease it to nothing as time goes on, so that after a while we are only exploring (x,a)'s that have worked out at least moderately well before.

At the start, Q is assumed to be totally inaccurate, so T is high (high exploration), and actions all have a roughly equal chance of being executed. T decreases as time goes on, and it becomes more and more likely to pick among the actions with the higher Q-values, until finally, as we assume Q is converging to , T approaches zero (pure exploitation) and we tend to only pick the action with the highest Q-value:

That is, the agent acts according to a stochastic policy which as time goes on becomes closer and closer to a deterministic policy.

The exception is if there are joint winners, in which case the final policy will still be stochastic.

Example

Say Q-values start at zero:

a        a1    a2    a3    a4   a5   a6
Q(x,a)   0     0     0     0    0    0

After some time, they are as follows (actions on 0 haven't been tried yet):

a        a1    a2     a3    a4    a5   a6
Q(x,a)   0     3.1   -3.3   0    -3.5  3.3

Now a high temperature would still pick actions randomly:

a      a1    a2    a3    a4    a5    a6
p(a)   1/6   1/6   1/6   1/6   1/6   1/6

A moderate temperature would pick actions with probability something like:

a      a1    a2    a3    a4    a5  a6
p(a)   0.2   0.3   0     0.2   0   0.3

A low temperature would pick actions with probability:

a      a1  a2    a3    a4  a5  a6
p(a)   0   0.1   0     0   0   0.9

Q. What other heuristics could we use, other than a heuristic based on the Q-values?

Tried v. Untried actions

A good question is how does our policy deal with actions that haven't been tried yet. Q-values may start at any values. 0 is not necessarily neutral. 0 isn't necessarily small (or large). It might be a high or low Q-value.

If we initialise all Q-values to Q_max, then every action will get tried at least once?
Only if we have time. If not, we could finish after finite time with the best actions being all actions we have never tried!
i.e. After learning, our policy of pick the best action yields a more-or-less random policy!
Could change it to distinguish between tried and untried actions:
e.g. After learning, policy is: "Take best Q-value of an action that has been tried.".
If we initialise all Q-values to Q_min, then some actions might get dropped without ever having been taken.
Could change it to distinguish between tried and untried actions:
Be narrow-minded about a's we have tried, random about a's we haven't tried?
Toss a coin and either pick random untried action or else pick from tried actions with Boltzmann.

Note we do know that the action hasn't been tried, because we are keeping a different α for each pair (x,a).

Q(x,a) = 0 isn't necessarily a neutral initial value

Starting Q-values at 0 isn't necessarily neutral. e.g. All r's in the system are negative.
But even if we have a "normal" reward function:

if (good thing) r = 1
else if (bad thing) r = -1
else r = 0

then 0 is a neutral reward, but that doesn't mean it's a neutral Q-value.
e.g. If (bad thing) happens all the time, no matter what you do, and if the best an optimal policy can manage is a few less (bad thing)'s than other policies, and a very rare (good thing), then the best Q-value might still be negative.

When finished finite learning

After finite learning, may have limited knowledge of some actions:

a       a1   a2   a3   a4   a5
Q(x,a)  -5   0    -1   -1   -5
n(x,a)  10   0    10    1    1

What have we learnt? Which action should we take?
Any elegant way of combining n and Q?

Greedy algorithm

Animals v. Computers

Deterministic control policy - Do finally settle on one action with p=1, others p=0. - "Computers" - Does same thing every time - Brittle - Infinite loops - Crashes

Stochastic control policy - Still have non-zero probabilities at run-time. - "Animals" - Usually, but not totally predictable - Don't "crash" or get stuck in infinite loop - Will eventually do something different

Admittedly, animals not getting stuck in infinite loop is also to do with: world changes (no absorbing states), internal state changes (hunger), ongoing competition among multiple drives, etc.

In my own coding, once I had coded a Boltzmann, I found it useful even after Q-values were learnt. Strict determinism (always pick same action in this state) can lead to a creature with brittle, "computer-like", behavior. If there are a number of more-or-less equally good actions with very little difference in their Q-values, we should probably rotate around them a little rather than pick the same one every time.
I am assuming I do not have an optimal policy (e.g. I have only approximated Q, perhaps not MDP anyway).
Makes no sense to do this if I have an optimal policy. - Then should use strict highest Q.

Some terms that are sometimes used:
- "Soft max" control policy sometimes called "soft policy" or "soft selection".
- off-policy learning - act with something other than best Q-values (because exploring)
- on-policy learning - act according to best Q-values (test your policy)

Infinite loops in nature

Digger wasp and "infinite loops"
- Various people have carried out an experiment where they move its prey while it is checking its burrow and get it into an infinite loop.
- Description of experiment
- Dawkins says that Fabre did the original experiment.

An Ant mill - an infinite loop in nature.

Absorbing states

An absorbing state x is one for which, :

Once in x, the agent can't leave. The problem with this is that it will stop us visiting all (x,a) an infinite number of times. Learning stops for all other states once the absorbing state is entered, unless there is some sort of artificial reset of the experiment.

In a real environment, where x contains information from the senses, it is hard to see how there could be such thing as an absorbing state anyway, since the external world will be constantly changing irrespective of what the agent does. It is hard to see how there could be a situation in which it could be relied on to supply the same sensory data forever.

An artificial world could have absorbing states though.