How Q-learning works

Figure illustrates how Q-learning deals with delayed reinforcement.

Action a leads from x to y.
Action b leads from y to z.
Action c leads from z to waterhole and reward.

Let all Q-values start at 0.
r > 0
γ > 0

How Q-learning works (adapted from a talk by Watkins).

Let be the estimated value of x during Q-learning. V(x) = Q(x,a), for some a which leads to some state y. Q(x,a) will be a combination of immediate reward plus the estimated value V(y). V(y) itself will be equal to some Q(y,b), where b leads to z. Q(y,b) is a combination of plus V(z), and so on.

Imagine that , but from state z the agent can take an action c that yields a high reward (Watkins used the example of an animal arriving at a waterhole). Then Q(z,c) will be scored high, and so V(z) increases. Q(y,b) does not immediately increase, but next time it is updated, the increased V(z) is factored in, discounted by . Q(y,b) and hence V(y) increase - y is now valuable because from there you can get to z. Q(x,a) does not immediately increase, but next time it is updated, the increased V(y) is factored in, discounted by (hence V(z) is effectively factored in, discounted by ). Q(x,a) increases because it led to y, and from there you can get to z, and from there you can get a high reward. In this way the rewards are slowly propagated back through time.

Spelling it out:

Go from y to z 100 times, but never went from z to waterhole:
Q(y,b) = 0

Eventually, in state z we took action c:
Q(z,c) = r [2]
At this point this is the only non-zero Q-value in the whole system.

Next time we take action b to go from y to z, we factor in the (best) Q-value of the next state:
Q(y,b) = 0 + γ ( r ) [1]
= γ r

Next time we go from x to y, this value is propagated back:
Q(x,a) = 0 + γ ( γ r ) [1]
= γ² r

If r is only non-zero reward in system, then even 100 steps away from waterhole it will take action heading towards waterhole, since:
γ¹⁰⁰ r > 0 [3]
Q-value = 0 + γ 0 + γ² 0 + ... + γ¹⁰⁰ r

In this way, the agent can learn chains of actions. It looks like it can see into the future - though in reality it lives only in the present (it cannot even predict what the next state will be). Using the language of ethology, this is how credit is propagated from a consummatory action back through the chain of appetitive actions that preceded it.

Q-learning solves the temporal credit assignment problem without having to keep a memory of the last few chains of states. The price you pay is that it demands repeated visits.

[1] Simplification 1:

Actually, I simplified above. The learning rate is actually constantly changing.

Here's how it works precisely. We try b repeatedly in y and it leads to r=0 and new state z. As of now, z is of no interest. The updates for Q(y,b) go:

Q(y,b) := 0 (0) + 1 ( 0 + γ 0 ) = 0
Q(y,b) := 1/2 (0) + 1/2 ( 0 + γ 0 ) = 0
Q(y,b) := 2/3 (0) + 1/3 ( 0 + γ 0 ) = 0
...

Q(y,b) = 0 and α(y,b) = 1/n.
Similarly, Q(x,a) = 0 and α(x,a) = 1/m.

Eventually, some time in state z, we try action c and get a reward. We get this reward the very first time we try (z,c).
The updates for Q(z,c) go: [2]

Q(z,c) := 0 (0) + 1 ( r + γ 0 )
= r
...

Next time we try (y,b), again we don't get reward, but now it does lead somewhere interesting - z now has a non-zero Q-value:

Q(y,b) := (n-1)/n (0) + 1/n ( 0 + γ r )
= 1/n γ r
...

Q. How about Q(x,a) now? What is the update the next time we try (x,a)?

Let's carry on.
Next time you update Q(y,b):

Q(y,b) := n/(n+1) (1/n γ r) + 1/(n+1) ( 0 + γ r )
= 2/(n+1) γ r

and again:

Q(y,b) := (n+1)/(n+2) (2/(n+1) γ r) + 1/(n+2) ( 0 + γ r )
= 3/(n+2) γ r

with increasing t:

Q(y,b) = t / (t+n-1) γ r

(n-1) constant - the number of times we scored Q(y,b) before we discovered z was good.

As t -> infinity
Q(y,b) -> γ r

Conclusion - It doesn't matter in the long run that we updated Q(y,b) any finite number of times before learning about z.

Q. Prove Q(x,a) -> γ² r

[2] Simplification 2:

In fact, I still simplified. Q(z,c) will not stay at r (with 0 forever after) unless the waterhole is an absorbing state or an absorbing group of states (an attractor). But if absorbing, how do we get back to y and x?

If not absorbing state, what might Q(z,c) really look like. Consider these two possibilities:

x = geographical state
Then you can go to waterhole, hop back, get r again, etc.
Q(z,c) = r + γ 0 + γ² r + γ³ 0 + γ⁴ r + ...
= r + γ² r + γ⁴ r + ...
x = (geographical state, combined with internal state such as thirst)
i.e. this is an abstract state-space, not a geographical state-space.
Then the act of drinking jumps you to a far-away state (in state-space, not in physical space) where you are close geographically but are not thirsty any more. This is one-way - you cannot just jump back to get r again. You may have to go through a large roundabout chain of states (getting thirsty again) to get back to z.
Q(z,c) = r + 0 + γ¹⁰⁰ r + 0 + γ²⁰⁰ r + ...
= r + γ¹⁰⁰ r + γ²⁰⁰ r + ...

One answer to all this:
It does not have to be an absorbing state. Let Q(z,c) go to whatever value, say Q(z,c) = Q. Then:

Q(y,b) -> γ Q

Q(x,a) -> γ² Q

etc., as before

[3] Simplification 3:

Actually, rather than:

γ¹⁰⁰ r > 0

it will be:

γ¹⁰⁰ r > γ¹⁰² r

(If you don't move towards the waterhole, you move away, but then you can move back in 1 more step). (*)

(*) Assuming you can move back, i.e. this is two-way, not one-way.

Normally (depending on the MDP problem of course) the good action doesn't have a Q-value massively bigger than the bad actions, only a bit bigger.
Because Q-learning measures you making the wrong choice now and then the right choice afterwards
(rather than measuring you making the wrong choice, followed by the wrong choice, etc.).

Q-value of stupid action = r_stupid + γ max_clever Q(y,b)

Stimulus-Response

Q-learning pushes the limits of stimulus-response behaviour. By associating x with y (and y with z and so on) it shows how an apparently "stimulus-response" agent can behave as if it has a past and future.

The whole trend of modern behaviour-based AI is to push the limits of what stimulus-response can do before postulating more complex models (though we all agree that more complex models are needed eventually). See Discussion of Stimulus-Response.