RL - The task

Deterministic v. Stochastic Policy

The agent acts according to a policy which tells it what action to take in each state x. A policy that specifies a unique action to be performed is called a deterministic policy - as opposed to a stochastic policy, where an action a is chosen from a distribution with probability .

a       a1   a2    a3
Q(x,a)  0    5.1   5

Deterministic policy - When in x, always take action a2.

Stochastic policy - When in x, 80 % of time take a2, 20 % of time take a3.

Typically - Start stochastic. Become more deterministic as we learn.

A policy that has no concept of time is called a stationary or memory-less policy - as opposed to a non-stationary policy (e.g. "do actions a and b alternately"). A non-stationary policy requires the agent to possess memory. Note that stochastic does not imply non-stationary.

Following a stationary deterministic policy , at time t, the agent observes state , takes action , observes new state , and receives reward with expected value:

The task - Go for short-term gain or long-term gain?

Do we go for short-term gain or long-term gain?
Can we take Short-term loss for Long-term gain?

Instead of deciding here for all problems, we make this a parameter that we can adjust.

Not just maximise immediate E(r) (i.e. simply deal with probability).

Could maximise long-term sum of expected rewards. This may involve taking a short-term loss for long-term gain (equivalent to allowing a move downhill in hill-climbing).

Discounted Reward - Using the discounting factor γ (gamma)

The task is for the agent to optimise not necessarily immediate rewards, but the total discounted reward.
In this measure, rewards received n steps into the future are worth less than rewards received now, by a factor of

where

The discounting factor γ defines how much expected future rewards affect decisions now.

0 ≤ γ < 1

(e.g. γ = 0.9)

If rewards can be positive or negative, R may be a sum of positive and negative terms.

Special case: γ = 0
Doesn't work: γ = 1 (see later)

The interesting situation is in the middle (0 < γ < 1)

Genuine immediate reinforcement learning is the special case

, where we only try to maximize immediate reward. Low γ means pay little attention to the future. High γ means that potential future rewards have a major influence on decisions now - we may be willing to trade short-term loss for long-term gain. Note that for

, the value of future rewards always eventually becomes negligible.

Short-term loss for Long-term gain

r in the future can even override immediate punishment

If we decide what to do based on which path has highest R, then [punishment now, r in the future] can beat [reward now, but not in future] if:

r is big enough
r is near enough - not discounted by γⁿ
both of the above

Bounds

If maximum reward is r_max, what is maximum possible value of R?
The most it could be would be to get the maximum reward every step:

What is this quantity on the right? Let:

Multiply by γ:

Subtract the 2 equations:

And we get a value for X:

That is:

This is just the Negative Binomial Series in the Binomial Theorem.
(x+1)^(-1) = 1 - x + x² - x³ + x⁴ - ...
Let x = - γ

Exercise - long-term reward

The task - Find the optimal policy

The expected total discounted reward if we follow policy forever, starting from , is:

displaymath6467

is called the value of state x under policy . The problem for the agent is to find an optimal policy - one that maximizes the total discounted expected reward (there may be more than one policy that does this).

It is known (not shown here) that in an MDP, there is a stationary and deterministic optimal policy. Briefly, in an MDP one does not need memory to behave optimally, the reason being essentially that the current state contains all information about what is going to happen next. A stationary deterministic optimal policy satisfies, for all x:

displaymath6468

For any state x, there is a unique value , which is the best that an agent can do from x. Optimal policies may be non-unique, but the value is unique. All optimal policies will have:

Dynamic Programming (DP)

If the transition probabilities

are explicitly known, Dynamic Programming finds an optimal policy by starting with random V(x) and random Q(x,a) and repeating forever (or at least until the policy is considered good enough):

displaymath6545

This systematic iterative DP process is obviously an off-line process. The agent must cease interacting with the world while it runs through this loop until a satisfactory policy is found.

Note that initially-random V and Q are on RHS of equation, so updates must be repeated to factor in more accurate values later.

The exception is if γ = 0 (only look at immediate reward). Then you only need one iteration of updates and you are done.
You have worked out immediate E(r) completely.

Trying out states and actions haphazardly

The problem for the Q-learning agent is to find an optimal policy when

are initially unknown (i.e. we are assuming when modelling the world that some such transition probabilities exist). The agent must interact with the world to learn these probabilities. Hence we cannot try out states and actions systematically as in Dynamic Programming. We may not know how to return to x to try out a different action (x,b) - we just have to wait until x happens to present itself again. We will experience states and actions rather haphazardly.

Fortunately, we can still learn from this. In the DP process above, the updates of V(x) can in fact occur in any order, provided that each x will be visited infinitely often over an infinite run. So we can interleave updates with acting in the world, having a modest amount of computation per timestep. In Q-learning we cannot update directly from the transition probabilities - we can only update from individual experiences.

How do I return to state x?

In the real world, we may not know how to get back to state x to try another action:
x = "I am at party"
a = "I say chat-up line c₁"
result - punishment

I might want to say "Go back to x, try c₂" but I may not know how to get back to x. All I can do is try out actions in the states in which I happen to find myself. I have to wait until state x happens to present itself again.

DP v. Q-learning

Q-learning is for when you don't know P_xa(y) and P_xa(r).

But one approach is just:
Interact with world large number of times to build up a table of estimates for P_xa(y) and P_xa(r).
Then just do offline DP.