Convergence

In the discrete case, we store each Q(x,a) explicitly, and update:

for some learning rate which controls how much weight we give to the reward just experienced, as opposed to the old Q-estimate. We typically start with , i.e. give full weight to the new experience. As decreases, the Q-value is building up an average of all experiences, and the odd new unusual experience won't disturb the established Q-value much. As time goes to infinity, , which would mean no learning at all, with the Q-value fixed.

Remember that because in the short term we have inaccurate Q(y,b) estimates, and because (more importantly) in the long term we have probabilistic transitions anyway, the return from the same (x,a) will vary.

We use a different for each pair (x,a), depending on how often we have visited state x and tried action a there. When we are in a new area of state-space (i.e. we have poor knowledge), the learning rate is high. We will have taken very few samples, so the average of them will be highly influenced by the current sample. When we are in a well-visited area of state-space, the learning rate is low. We have taken lots of samples, so the single current sample can't make much difference to the average of all of them (the Q-value).

Probably true in all learning that you may spend 90 percent of your time going over the same 10 percent of states.

Convergence conditions

[Watkins and Dayan, 1992] proved that the discrete case of Q-learning will converge to an optimal policy under the following conditions.

The learning rate , where , should take decreasing (with each update) successive values , such that and .

The typical scheme is, where is the number of times Q(x,a) has been visited:

If each pair (x,a) is visited an infinite number of times, then Watkins and Dayan showed that for lookup tables Q-learning converges to a unique set of values which define a stationary deterministic optimal policy.

Notes on α

The learning rate α,
where

,
should take decreasing (with each update) successive values

,
such that

and

Notes:

The harmonic series 1 + 1/2 + 1/3 + ... is an infinite sum.
Proof: Oresme
The infinite series: 1 + (1/2)^p + (1/3)^p + ... is a finite sum if p > 1 (and infinite for p < = 1).
Proof: The sum is equal to:
1 + ( (1/2)^p+(1/3)^p ) + ( (1/4)^p+(1/5)^p+(1/6)^p+(1/7)^p ) + ...
< 1 + ( (1/2)^p+(1/2)^p ) + ( (1/4)^p+(1/4)^p+(1/4)^p+(1/4)^p ) + ...
= 1 + 2 (1/2)^p + 4 (1/4)^p + ...
= 1 + r + r² + r³ + ...
where r = (1/2)^p-1
p-1 > 0, i.e. r < 1
and the geometric series 1 + r + r² + r³ + ... with 0 < r < 1 is a finite sum. Proof.

Consider:

= 1, (1/2)^p, (1/3)^p, ...
Need: 1 + (1/2)^p + (1/3)^p + .... to be infinite,
hence: p < = 1
Need: 1 + (1/2)^2p + (1/3)^2p + .... to be finite,
hence: 2p > 1
hence:

will satisfy the conditions for any

Map of α(x,a)

Consider map of α(x,a)

At start, whole map is 1.
After a time, unvisited (x,a) pairs still have α(x,a) = 1.
Visited ones are down to 1/n (which means visited how many times?).
Unvisited states x have α(x,b) = 1 for all b.

The optimal policy

Q-learning is asynchronous and sampled - each Q(x,a) is updated one at a time, and the control policy may visit them in any order, so long as it visits them an infinite number of times. Watkins describes his algorithm as "incremental dynamic programming by a Monte Carlo method".

After convergence, the agent then will maximize its total discounted expected reward if it always takes the action with the highest -value. That is, the optimal policy is defined by where:

Then:

may be non-unique, but the value is unique.

Initial Q-values can be random. typically starts at 0, wiping out the initial Q-value in the first update. Although the term may factor in an inaccurate initial Q-value from elsewhere, these Q-values will themselves be wiped out by their own first update, and poor initial samples have no effect if in the long run samples are accurate (Theorem A.2 later).

Optimal policy found before convergence

Each Q-value Q(x,a) converges to Q^*(x,a)
where the optimum action to take in x is the action a^* s.t.
Q^*(x,a^*) = max_b Q^*(x,b)

But what normally happens is:
Long before the Q-values finalise, we will have fixated on the best action. Often surprisingly early, we find that:
max_b Q(x,b) = Q(x,a^*)
(note Q, not Q^*) - that is, the leader now will still be the leader when the Q-values converge - the leader now is actually a^* though we don't know it yet.

In practice, the competition rapidly settles down into a contest between 1 or 2 actions.

We usually find that a greedy policy (execute the action with the highest Q-value) is an optimal policy long before the Q-values have converged to their final values.

RL course
- Convergence of Q-learning

Need infinite time?

"Infinite no. of samples" seems like a very weak convergence result. But it is only demanded because in a probabilistic world there is always a non-zero probability of any pattern of unlucky samples - e.g. Could get 1000 heads in a row, even though the coin is not only not biased towards heads, but in fact is biased towards tails. In reality, we would be very unlucky not to be operating according to an optimal policy quite quickly.

Q. If coin is biased 75-25 towards tails, what is prob. of 1000 heads in a row?

Q. If you did one 1000-toss experiment a day, how long would you wait to see a single 1000-head event?

p(policy is optimal)

Another way of looking at it:

As we learn, we have a policy. Eventually this policy is optimal. We can definitely say:

As time goes to infinity, p(policy is optimal) -> 1.

In practice, need to be very unlucky not to have optimal policy after short finite time.

e.g. After short finite time, p(policy is optimal) = 0.99999.

A perverse world

You could deliberately design a difficult world, in which learning the optimal policy took a long time.
Example: P_xa(y) = 0.01, P_yb(z) = 0.01.
No other way of getting to z other than taking b in y (even though most times it won't lead to z).
No other way of getting to y other than taking a in x (even though most times it won't lead to y).
z then has one action c with high rewards.

Consider:

z has only non-zero reward in system.
In finite time: Will learn random policy. Never get into z. Never found any rewards anywhere.
There are other rewards in the system, but z has a very high reward so that optimal policy is to go to z.
Needs to be very high reward in z to make up for the low probability of getting it (high probability of going to other, worthless states if you take b in y).

Q(y,b) = 0 + γ E(V(next state))

V(z) is very high, but y,b only leads there with prob. 0.01.
V(state) is very low for the states that y,b leads to with combined prob. 0.99.
(0.99) low + (0.01) high
can still be high if "high" is high enough.
Hence Q(y,b) can be high if the r for c is high enough.
Optimal policy is to take actions a in x, b in y, c in z.
But will take a long time to learn this. Because it's hard to get to y at all. And even harder to get to z at all. And you have to get to both many times in order to find b and c.
In finite time: Will learn sub-optimal policy of maximising the other rewards. Will think we've done well. Never find out about higher rewards available at z.

Can you learn optimal policy in finite n steps?

It depends what n is.

Given that a real-world experiment will end (has to end) after n timesteps, for some finite n, it is easy to design a MDP such that it is highly unlikely that you will learn an optimal policy in n steps. Can do this no matter what n is, so long as it is finite.

How do we know your problem world is not like this? You just have to hope your problem world is not like this. You just have to hope your n is big enough.

Convergence proof may be of limited use

Convergence proof is nice, but may be of limited use.

Q-learning is attractive because of the simplicity of the computational demands per timestep, and because of this proof of convergence to a global optimum, avoiding all local optima.

World may not be MDP:
One catch though is that the world has to be a Markov Decision Process. Many interesting worlds are liable to break the MDP model.
Convergence only works with lookup tables:
Another catch is that convergence is only guaranteed when using lookup tables, while the statespace may be so large as to make discrete lookup tables, with one cell for every combination of state and action, require impractically large amounts of memory. In large statespaces one will want to use some form of generalisation, like a neural network, but then the convergence proof no longer holds.
Maximising r's may not solve the problem:
Finally, even if the world can be approximated by an MDP, and our generalisation can reasonably approximate lookup tables, the policy that Q-learning finds may be surprising. The optimal solution means maximising the rewards, which may or may not actually solve the problem the designer of the rewards had in mind. The agent may find a policy that maximizes the rewards in unexpected and unwelcome ways. Still, the promise of RL is that designing reward functions will be easier than designing behavior.
See discussion of designing reward functions here and here.

Maximising r's may not solve the problem:
My agent in this 1996 demo maximises rewards by jumping in and out of the "nest".

Maximising r's may not solve the problem:
An undergrad student of mine, Eoin Murphy, uses RL to play Super Mario in 2016.
His agent maximises r's by jumping up and down. (Starts at 1:08.)

Evolving neural network to play Super Mario: "MarI/O" program

Dynamic Programming - "For all x,a, try a in x"

Recall Dynamic Programming.
In DP - Don't actually visit any state. Don't actually take any action. Can do it all in your head. What would happen if we took action a in x.

RL - Have to actually be in state x and have to actually execute action a to find out what happens.
Problem - Can't systematically say "for all x,a, go to x, take a" because don't know how to get to x. That's the whole problem.
If we knew how to get to any x, no need to learn anything! Just say "start chess game. go to checkmate" and you're done!

DP doesn't know how to get to x either, but it doesn't have to actually get there. RL does have to actually get to x.

Use DP or RL?

If we know P_xa(y), then DP can construct the optimal policy without interacting with the world at all.

If we don't know P_xa(y), then Q-learning can repeatedly sample the world to construct an optimal policy, if it is a Markov world.

But of course if we don't know P_xa(y), how do we know it's a Markov world!

Paradox

If we know it's an MDP, we don't need Q-learning!

If we don't know it's an MDP, we don't know if Q-learning works!

In practice, no one ever applies Q-learning to an MDP!

Q-learning is only ever applied to unknown worlds, that we hope can be approximated by an MDP (i.e. applying Q-learning will produce an alright policy).

Kaelbling survey
- non-Markov worlds