RL - The world

A Reinforcement Learning (RL) agent senses a world, takes actions in it, and receives numeric rewards and punishments from some reward function based on the consequences of the actions it takes. By trial-and-error, the agent learns to take the actions which maximize its rewards.

The basic algorithm of the agent:


we build up estimates of Q(x,a) = how good action a is in state x
start with Q(x,a) random or zero

do for a long time
{
  observe x
  suggest a according to some algorithm
  execute a
  observe y
  receive reward r(y) or r(x,y)
  modify Q(x,a) according to some algorithm
}

a can be "no-op".

Note that having a reward every time step is actually no restriction. The classic delayed reinforcement problem is just a special case with, say, most rewards zero, and only a small number of infrequent rewards non-zero.

Sample definitions of x

Clocksin and Moore
House Robot
Ant World
Tyrrell's world
Chess board
- Scheme 1: x = (x1,x2,..,x64) where each x_i = what?
- Scheme 2: x = (x1,x2,..,x32) where each x_i = what?
- Problem for scheme 2: Promotion allows for more than 32 pieces (e.g. two queens).
- Problem for schemes 1 and 2: Castling and En passant are only allowed if certain previous conditions hold, so this information must be in x.
- See Forsyth-Edwards Notation

Markov Decision Process (MDP)

A simple way of modelling an uncertain world is to model it as a Markov Decision Process (MDP).

The agent observes discrete states of the world x ( , a finite set) and can execute discrete actions a ( , a finite set). Each discrete time step, the agent observes state x, takes action a, observes new state y, and receives immediate reward r.

Transitions are probabilistic, that is, y and r are drawn from stationary probability distributions and , where is the probability that taking action a in state x will lead to state y and is the probability that taking action a in state x will generate reward r. We have and .

Probabilistic Transitions

Start in state x:

In state x,
taking action b gives a guaranteed reward of 1.5.
Taking action a normally leads to a reward of only 1, but sometimes leads to a reward of 5.

P_xa(y) = 0.8
P_xa(z) = 0.2
P_xa(w) = 0 for all w other than y,z
(probabilistic transition)

P_xb(s) = 1
P_xb(w) = 0 for all w other than s
(deterministic transition)

Markov and stationary

The transition probability can be viewed as a conditional probability. Where

are the values of x,a at time t, the Markov property is expressed as:

and the stationary property is expressed as:

World may just look probabilistic

Deterministic world may just look probabilistic because of poor senses.
e.g. Imagine we sense direction but not distance.
Features of the world that are deterministic can actually appear probabilistic to the agent.
This is illustrated in Figure. When in the situation on the left, the agent will experience:

. When in the situation on the right, the agent will experience:

Probabilistic transitions .

Non-Markovian worlds

In fact the world above is not exactly an MDP.
Consider:
P ( x_t+1=bump | x_t=s_t, a_t=b_t, x_t-1=s_t-1, a_t-1=b_t-1, ... )

P ( bump | wall is N, move N, wall is N, move N, wall is N, move N )
is higher than:
P ( bump | wall is N, move N, wall not visible, move N, wall not visible, move N )

i.e. There isn't simply a constant:
P ( bump | wall is N, move N )

Non-stationary probabilistic transitions

. The states' positions in the diagram correspond to their geographical positions. x is the state "food is EastNorthEast". y is the state "food is East". The agent sees both y1 and y2 as the same state y. The quantities on the arrows are the rewards for that transition.

leads to state y, and also leads to state y. But the probability of leading to an immediate reward depends on the agent's policy. If it tends to take action , the value of y will be higher than if it tends to take action . But it will not necessarily learn to take action as a result - that would just be normal learning, where it can tell the states apart. is just as likely a policy because is also rewarded when the value of y increases.

is non-stationary. In fact, the world is what is called a Partially-Observable MDP.

Q. What simple change would turn the above into a MDP (in fact, a deterministic MDP)?

Deterministic worlds

The simple deterministic world is a special case with all transition probabilities equal to 1 or 0. For any pair x,a, there will be a unique state

and a unique reward

such that:

displaymath6387

Deterministic and probabilistic worlds

Consider:

Deterministic world that looks deterministic.
Deterministic world that looks probabilistic.
Probabilistic world that looks probabilistic.
Probabilistic world that looks deterministic.

Kaelbling survey
- non-Markov worlds
Examples:
- Chess - Apparently deterministic - given x, take a, see predictable next state y. However, y is the state your opponent gets to take an action in, not you. He takes unpredictable action b, leading to state z. So from your point of view, you see x, take a, wait for a bit, and then see unpredictable z. From your point of view, world is probabilistic (as is any world with an opponent).
- Football - Non-Markov world. Don't know what's in other players' minds, so don't know if set-piece free kick will work this time (as it did last time).

All adversarial games (chess, Xs and Os):

Observe x
Take a
(Opponent makes non-deterministic move)
Observe y

Can non-Markov worlds be solved?
- Some can - e.g. Transition depends on this state plus previous state only.
- Some can't - e.g. World where transition probabilities change randomly. Consequences of all actions are random.
Is the universe deterministic?
- Determinism
  - Determinism, quantum mechanics and classical physics
- Scientific determinism
- Interpretation of quantum mechanics
- Quantum indeterminacy
- Hidden variable theory - The (very much minority) theory that the universe could be deterministic below the quantum level, but appear probabilistic at the quantum level.

Action = "no-op"

The set of actions available may differ from state to state. Depending on what we mean by this, this may also possibly be represented within the model, as the special case of an action that when executed (in this particular state, not in general) does nothing:

for all actions a in the "unavailable" set for x .

Note this would not exactly be "do nothing" since you would still get the reward for staying in x. It could therefore be a useful action!

Expected Reward

In this:

How do we decide which action to take?
Look at expected reward:
E_a(r) = 0.2 (5) + 0.8 (1)
= 1.8
E_b(r) = 1 (1.5)
= 1.5
Action a is better.

What do the probabilities mean?
They mean that if you take a 100 times, you expect 20 5's and 80 1's. Total reward 180.
If you take b 100 times you expect 100 1.5's. Total reward 150.

Maximise E(r).

When we take action a in state x, the reward we expect to receive is:

r_min ≤ E(r) ≤ r_max

Rewards r are bounded by

, where

(

would be a system where the reward was the same no matter what action was taken. The agent would always behave randomly and would be of no use or interest). Hence for a given x,a,

E(r) = p₁ r₁ + ... + p_n r_n
≥ p₁ r_min + ... + p_n r_min
= 1 . r_min

E(r) exists, E(y) doesn't exist

Imagine a state definition for a robot carrying blocks to put in a bin:
x = (x1,x2)
x1 = what you are carrying: 0=not carrying block, 1=carrying block
x2 = direction of bin, values 0-9

You are in state x = (1,5).
Imagine that if you take some action a, you go to state y with probabilities:


y               Pxa(y)          reward

(1,8)           0.4             10
(0,8)           0.3             5
(1,5)           0.2             1
(0,5)           0.1             0

You can calculate E(r).
Q. What is it?

But "E(y)" is meaningless.
Consider the sum of the y's times their probabilities
= (1,8) 0.4 + (0,8) 0.3 + (1,5) 0.2 + (0,5) 0.1
= (0.6, 7.1)

You are never in this state when you take a.
In fact, this state doesn't exist.
"0.6" doesn't have a meaning.
The state space is not a continuous space. It is a discrete space of points.

r(x,y)

Typically, we don't reward for the correct action, r = r(x,a):

in state x
if took action b
  then reward 1
else
  reward 0

because we don't know the correct action! That's why we're learning.
Instead we reward on getting to some state (somehow, we don't know how), r = r(x,y):

in state x
if got to state z
  then reward 1
else
  reward 0

Writing r = r(x,y), the probability of a particular reward r is:

and the expected reward becomes:

That is, normally we never think in terms of - we only think in terms of .

e.g. In the following:

Taking action a in state x,
P(r=5) = 0.2
P(r=0) = 0.4
P(r=3) = 0.3 + 0.1
= 0.4
Question - What is E(r)?

Reward on Continuous or Transition?

In some models, rewards are associated with states, rather than with transitions, that is, r = r(y). The agent is not just rewarded for arriving at state y - it is also rewarded continually for remaining in state y. This is obviously just a special case of r = r(x,y) with:

and hence:

Reward on transition to good state only: r = r(x,y).
e.g. x = not at waterhole, y = at waterhole. Problem - may leave state to come back in!

Reward for arriving and staying in good state: r = r(y).

Movie demo of AntWorld
See discussion on Rewarding on continuous or transition.