Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds


Free AI exercises

Research - PhD - Chapter 1 - Chapter 2

2 Reinforcement Learning

A Reinforcement Learning (RL) agent senses a world, takes actions in it, and receives numeric rewards and punishments from some reward function based on the consequences of the actions it takes. By trial-and-error, the agent learns to take the actions which maximize its rewards. For a broad survey of the field see [Kaelbling et al., 1996].

2.1 Q-learning

Watkins [Watkins, 1989] introduced the method of reinforcement learning called Q-learning. The agent exists within a world that can be modelled as a Markov Decision Process (MDP). It observes discrete states of the world x ( tex2html_wrap_inline6339 , a finite set) and can execute discrete actions a ( tex2html_wrap_inline6343 , a finite set). Each discrete time step, the agent observes state x, takes action a, observes new state y, and receives immediate reward r. Transitions are probabilistic, that is, y and r are drawn from stationary probability distributions tex2html_wrap_inline6279 and tex2html_wrap_inline6281 , where tex2html_wrap_inline6279 is the probability that taking action a in state x will lead to state y and tex2html_wrap_inline6281 is the probability that taking action a in state x will generate reward r. We have tex2html_wrap_inline6377 and tex2html_wrap_inline6379 .

Note that having a reward every time step is actually no restriction. The classic delayed reinforcement problem is just a special case with, say, most rewards zero, and only a small number of infrequent rewards non-zero.

The transition probability can be viewed as a conditional probability. Where tex2html_wrap_inline6381 are the values of x,a at time t, the Markov property is expressed as:


and the stationary property is expressed as:


2.1.1 Special case - Deterministic worlds

The simple deterministic world is a special case with all transition probabilities equal to 1 or 0. For any pair x,a, there will be a unique state tex2html_wrap_inline6391 and a unique reward tex2html_wrap_inline6393 such that:


2.1.2 Special case - Different action sets

The set of actions available may differ from state to state. Depending on what we mean by this, this may also possibly be represented within the model, as the special case of an action that when executed (in this particular state, not in general) does nothing:


for all actions a in the "unavailable" set for x .

This would not exactly be "do nothing" since you would still get the reward for staying in x.

Later in this thesis, we will be considering how agents react when the actions of other agents are taken, actions that this agent does not recognise. Here the action is recognised by the agent, just in some states it becomes unavailable.

In the multi-agent case, an agent could assume that an unrecognised action has the effect of "do nothing" but it might be unwise to assume without observing.

2.1.3 Notes on expected reward

When we take action a in state x, the reward we expect to receive is:


Typically, the reward function is not phrased in terms of what action we took but rather what new state we arrived at, that is, r is a function of the transition x to y. The general idea is that we can't reward the right a because we don't know what it is - that's why we're learning. If we knew the correct a we could use ordinary supervised learning.

Writing r = r(x,y), the probability of a particular reward r is:

and the expected reward becomes:


That is, normally we never think in terms of tex2html_wrap_inline6281 - we only think in terms of tex2html_wrap_inline6279 . Kaelbling [Kaelbling, 1993] defines a globally consistent world as one in which, for a given x,a, E(r) is constant. This is equivalent to requiring that tex2html_wrap_inline6281 is stationary (hence, with the typical type of reward function, tex2html_wrap_inline6279 stationary).

In some models, rewards are associated with states, rather than with transitions, that is, r = r(y). The agent is not just rewarded for arriving at state y - it is also rewarded continually for remaining in state y. This is obviously just a special case of r = r(x,y) with:


and hence:


Rewards r are bounded by tex2html_wrap_inline6455 , where tex2html_wrap_inline6457 ( tex2html_wrap_inline6459 would be a system where the reward was the same no matter what action was taken. The agent would always behave randomly and would be of no use or interest). Hence for a given x,a, tex2html_wrap_inline6463 .

2.1.4 The task

The agent acts according to a policy tex2html_wrap_inline6475 which tells it what action to take in each state x. A policy that specifies a unique action to be performed tex2html_wrap_inline6479 is called a deterministic policy - as opposed to a stochastic policy, where an action a is chosen from a distribution tex2html_wrap_inline6483 with probability tex2html_wrap_inline6485 .

A policy that has no concept of time is called a stationary or memory-less policy - as opposed to a non-stationary policy (e.g. "do actions a and b alternately"). A non-stationary policy requires the agent to possess memory. Note that stochastic does not imply non-stationary.

Following a stationary deterministic policy tex2html_wrap_inline6475 , at time t, the agent observes state tex2html_wrap_inline6495 , takes action tex2html_wrap_inline6497 , observes new state tex2html_wrap_inline6499 , and receives reward tex2html_wrap_inline6501 with expected value:


The agent is interested not just in immediate rewards, but in the total discounted reward. In this measure, rewards received n steps into the future are worth less than rewards received now, by a factor of tex2html_wrap_inline6505 where tex2html_wrap_inline6507 :


The discounting factor γ defines how much expected future rewards affect decisions now. Genuine immediate reinforcement learning is the special case tex2html_wrap_inline6511 , where we only try to maximize immediate reward. Low γ means pay little attention to the future. High γ means that potential future rewards have a major influence on decisions now - we may be willing to trade short-term loss for long-term gain. Note that for tex2html_wrap_inline6517 , the value of future rewards always eventually becomes negligible. The expected total discounted reward if we follow policy tex2html_wrap_inline6475 forever, starting from tex2html_wrap_inline6495 , is:


tex2html_wrap_inline6523 is called the value of state x under policy tex2html_wrap_inline6475 . The problem for the agent is to find an optimal policy - one that maximizes the total discounted expected reward (there may be more than one policy that does this). Dynamic Programming (DP) theory [Ross, 1983] assures us of the existence of an optimal policy for an MDP, also (perhaps surprisingly) that there is a stationary and deterministic optimal policy. In an MDP one does not need memory to behave optimally, the reason being essentially that the current state contains all information about what is going to happen next. A stationary deterministic optimal policy tex2html_wrap_inline6529 satisfies, for all x:


For any state x, there is a unique value tex2html_wrap_inline6535 , which is the best that an agent can do from x. Optimal policies tex2html_wrap_inline6529 may be non-unique, but the value tex2html_wrap_inline6541 is unique. All optimal policies tex2html_wrap_inline6529 will have:


2.1.5 The strategy

The strategy that the Q-learning agent adopts is to build up Quality-values (Q-values) Q(x,a) for each pair (x,a). If the transition probabilities tex2html_wrap_inline6279 , tex2html_wrap_inline6281 are explicitly known, Dynamic Programming finds an optimal policy by starting with random V(x) and random Q(x,a) and repeating forever (or at least until the policy is considered good enough):


This systematic iterative DP process is obviously an off-line process. The agent must cease interacting with the world while it runs through this loop until a satisfactory policy is found.

The problem for the Q-learning agent is to find an optimal policy when tex2html_wrap_inline6279 , tex2html_wrap_inline6281 are initially unknown (i.e. we are assuming when modelling the world that some such transition probabilities exist). The agent must interact with the world to learn these probabilities. Hence we cannot try out states and actions systematically as in Dynamic Programming. We may not know how to return to x to try out a different action (x,b) - we just have to wait until x happens to present itself again. We will experience states and actions rather haphazardly.

Fortunately, we can still learn from this. In the DP process above, the updates of V(x) can in fact occur in any order, provided that each x will be visited infinitely often over an infinite run. So we can interleave updates with acting in the world, having a modest amount of computation per timestep. In Q-learning we cannot update directly from the transition probabilities - we can only update from individual experiences.

Core Q-learning algorithm:

In 1-step Q-learning, after each experience, we observe state y, receive reward r, and update:

(remember notation).

For example, in the discrete case, where we store each Q(x,a) explicitly in lookup tables, we update:

Since the rewards are bounded, it follows that the Q-values are bounded (Theorem B.2). Note that we are updating from the current estimate of Q(y,b) - this too is constantly changing.

2.1.6 How Q-learning works

Figure 2.1 illustrates how Q-learning deals with delayed reinforcement. Let tex2html_wrap_inline6589 be the estimated value of x during Q-learning. V(x) = Q(x,a), for some a which leads to some state y. Q(x,a) will be a combination of immediate reward tex2html_wrap_inline6393 plus the estimated value V(y). V(y) itself will be equal to some Q(y,b), where b leads to z. Q(y,b) is a combination of tex2html_wrap_inline6615 plus V(z), and so on.

Figure 2.1: How Q-learning works (adapted from a talk by Watkins).

Imagine that tex2html_wrap_inline6619 , but from state z the agent can take an action c that yields a high reward tex2html_wrap_inline6625 (Watkins used the example of an animal arriving at a waterhole). Then Q(z,c) will be scored high, and so V(z) increases. Q(y,b) does not immediately increase, but next time it is updated, the increased V(z) is factored in, discounted by tex2html_wrap_inline6509 . Q(y,b) and hence V(y) increase - y is now valuable because from there you can get to z. Q(x,a) does not immediately increase, but next time it is updated, the increased V(y) is factored in, discounted by tex2html_wrap_inline6509 (hence V(z) is effectively factored in, discounted by tex2html_wrap_inline6653 ). Q(x,a) increases because it led to y, and from there you can get to z, and from there you can get a high reward. In this way the rewards are slowly propagated back through time.

In this way, the agent can learn chains of actions. It looks like it can see into the future - though in reality it lives only in the present (it cannot even predict what the next state will be). Using the language of ethology [Tyrrell, 1993], this is how credit is propagated from a consummatory action back through the chain of appetitive actions that preceded it.

Q-learning solves the temporal credit assignment problem without having to keep a memory of the last few chains of states. The price you pay is that it demands repeated visits. Weir [Weir, 1984] argues that the notion of state needs to be expanded into a trajectory or chain of states through the statespace, but this example shows that such a complex model is not needed for the purposes of time-based behavior.

2.1.7 Notes on building a world model

As Q-learning progresses, the agent experiences the world's transitions. It does not normally build up an explicit map of tex2html_wrap_inline6281 of the form tex2html_wrap_inline6665 . Instead it sums the rewards to build up the probably more useful mapping tex2html_wrap_inline6667 . In fact this exact mapping itself is not stored, but is only one component of the Q-values, which express more than just immediate reward. If tex2html_wrap_inline6511 the Q-values express the exact map tex2html_wrap_inline6667 .

The agent also experiences the state transitions and can if it chooses build up a world model, that is, estimates of tex2html_wrap_inline6279 . This would be the map tex2html_wrap_inline6675 , whose implementation would demand a much larger statespace. Rewards can be coalesced of course but states can't. There is no such thing as E(y) (except for the special case of a deterministic world tex2html_wrap_inline6679 ).

Moore [Moore, 1990] applies machine learning to robot control using a state-space approach, also beginning without knowledge of the world's transitions. His robot does not receive reinforcement, but rather learns tex2html_wrap_inline6279 in situations where the designer can specify what the desired next y should be.

Moore works with a statespace too big for lookup tables. He stores (x,a,y) triples (memories) in a nearest-neighbour generalisation. With probabilistic transitions, the agent may experience (x,a,y) at some times and (x,a,z) at others. There may then be a conflict between finding the needed action: x,y predicts a, and predicting what it will do: x,a predicts not y. In fact Moore shows that this conflict can arise even in a deterministic world after limited experience (where the exemplars you use for prediction are not exact but are the nearest neighbours only).

Sutton's DYNA architecture [Sutton, 1990, Sutton, 1990a] uses reinforcement learning and builds a model in a deterministic world with a statespace small enough to use lookup tables. It just fills in the map tex2html_wrap_inline6679 .

The point that Q-learning illustrates though is that learning an explicit world model is not necessary for the central purpose of learning what actions to take. Q-learning is model-free ([Watkins, 1989, §7] calls this primitive learning). We ignore the question of what x,a actually lead to, and the problem of storage of that large and complex mapping, and just concentrate on scoring how good x,a is.

The agent can perform adaptively in a world without understanding it. All it tries to do is sort out good actions to perform from bad ones. A good introduction to this approach is [Clocksin and Moore, 1989], which uses a state-space approach to control a robot arm. Instead of trying to explicitly solve the real-time inverse dynamics equations for the arm, Clocksin and Moore view this as some generally unknown nonlinear I/O mapping and concentrate on filling in the statespace by trial and error until it is effective. They argue against constructing a mathematical model of the world in advance and hoping that the world will broadly conform to it (with perhaps some parameter adjustment during interaction). But here we could learn the entire world model by interaction. What would be the advantage of doing this?

For a single problem in a given world, a model is of limited use. As [Kaelbling, 1993] points out, in the time it takes to learn a model, the agent can learn a good policy anyway. The real advantage of a model is to avoid having to re-learn everything when either the world or problem change:

2.2 Discrete Q-learning

In the discrete case, we store each Q(x,a) explicitly, and update:


for some learning rate tex2html_wrap_inline6315 which controls how much weight we give to the reward just experienced, as opposed to the old Q-estimate. We typically start with tex2html_wrap_inline6735 , i.e. give full weight to the new experience. As tex2html_wrap_inline6315 decreases, the Q-value is building up an average of all experiences, and the odd new unusual experience won't disturb the established Q-value much. As time goes to infinity, tex2html_wrap_inline6739 , which would mean no learning at all, with the Q-value fixed.

Remember that because in the short term we have inaccurate Q(y,b) estimates, and because (more importantly) in the long term we have probabilistic transitions anyway, the return from the same (x,a) will vary. See §A.1 to understand how a decreasing tex2html_wrap_inline6315 is generally used to take a converging expected value of a distribution.

We use a different tex2html_wrap_inline6747 for each pair (x,a), depending on how often we have visited state x and tried action a there. When we are in a new area of state-space (i.e. we have poor knowledge), the learning rate is high. We will have taken very few samples, so the average of them will be highly influenced by the current sample. When we are in a well-visited area of state-space, the learning rate is low. We have taken lots of samples, so the single current sample can't make much difference to the average of all of them (the Q-value).

2.2.1 Convergence

[Watkins and Dayan, 1992] proved that the discrete case of Q-learning will converge to an optimal policy under the following conditions. The learning rate tex2html_wrap_inline6315 , where tex2html_wrap_inline6763 , should take decreasing (with each update) successive values tex2html_wrap_inline6765 , such that tex2html_wrap_inline6767 and tex2html_wrap_inline6769 . The typical scheme (and the one used in this work) is, where tex2html_wrap_inline6771 is the number of times Q(x,a) has been visited:


If each pair (x,a) is visited an infinite number of times, then [Watkins and Dayan, 1992] shows that for lookup tables Q-learning converges to a unique set of values tex2html_wrap_inline6777 which define a stationary deterministic optimal policy. Q-learning is asynchronous and sampled - each Q(x,a) is updated one at a time, and the control policy may visit them in any order, so long as it visits them an infinite number of times. Watkins [Watkins, 1989] describes his algorithm as "incremental dynamic programming by a Monte Carlo method".

After convergence, the agent then will maximize its total discounted expected reward if it always takes the action with the highest tex2html_wrap_inline6781 -value. That is, the optimal policy tex2html_wrap_inline6529 is defined by tex2html_wrap_inline6785 where:




tex2html_wrap_inline6787 may be non-unique, but the value tex2html_wrap_inline6781 is unique.

Initial Q-values can be random. tex2html_wrap_inline6791 typically starts at 0, wiping out the initial Q-value in the first update (Theorem A.1). Although the tex2html_wrap_inline6793 term may factor in an inaccurate initial Q-value from elsewhere, these Q-values will themselves be wiped out by their own first update, and poor initial samples have no effect if in the long run samples are accurate (Theorem A.2). See [Watkins and Dayan, 1992] for the proof that initial values are irrelevant.

Note that tex2html_wrap_inline6795 will satisfy the conditions for any tex2html_wrap_inline6797 . The typical tex2html_wrap_inline6315 goes from 1 down to 0, but in fact if the conditions hold, then for any t, tex2html_wrap_inline6803 and tex2html_wrap_inline6805 , so tex2html_wrap_inline6315 may start anywhere along the sequence. That is, tex2html_wrap_inline6315 may take successive values tex2html_wrap_inline6811 (see Theorem A.2).

2.2.2 Discussion

Q-learning is an attractive method of learning because of the simplicity of the computational demands per timestep, and also because of this proof of convergence to a global optimum, avoiding all local optima. One catch though is that the world has to be a Markov Decision Process. As we shall see, even interesting artificial worlds are liable to break the MDP model.

Another catch is that convergence is only guaranteed when using lookup tables, while the statespace may be so large as to make discrete lookup tables, with one cell for every combination of state and action, require impractically large amounts of memory. In large statespaces one will want to use some form of generalisation, but then the convergence proof no longer holds.

Finally, even if the world can be approximated by an MDP, and our generalisation can reasonably approximate lookup tables, the policy that Q-learning finds may be surprising. The optimal solution means maximising the rewards, which may or may not actually solve the problem the designer of the rewards had in mind. The agent may find a policy that maximizes the rewards in unexpected and unwelcome ways (see [Humphrys, 1996, §4.1] for an example). Still, the promise of RL is that designing reward functions will be easier than designing behavior. We will return to the theme of designing reward functions in §4.3.1 and §4.4.3.

Also note that Q-learning's infinite number of (x,a) visits can only be approximated. This is not necessarily a worry because the important thing is not the behavior and Q-values at infinity, but rather that in finite time it quickly gets down to focusing on one or two actions in each state. We usually find that a greedy policy (execute the action with the highest Q-value) is an optimal policy long before the Q-values have converged to their final values.

2.2.3 The control policy

In real life, since we cannot visit each (x,a) an infinite number of times, and then exploit our knowledge, we can only approximate Q-learning. We could do a large finite amount of random exploration, then exploit our knowledge. In fact this is the simple method I use on the smaller Q-learning statespaces in this work.

On the larger-statespace Q-learning problems, random exploration takes too long to focus on the best actions, and also in a generalisation it will cause a huge amount of noise, so instead I use a method that interleaves exploration and exploitation. The idea is to start with high exploration and decrease it to nothing as time goes on, so that after a while we are only exploring (x,a)'s that have worked out at least moderately well before.

The specific control policy used is a standard one in the field and originally comes from [Watkins, 1989] and [Sutton, 1990]. The agent tries out actions probabilistically based on their Q-values using a Boltzmann or soft max distribution. Given a state x, it tries out action a with probability:

Note that tex2html_wrap_inline6827 whether the Q-value is positive or negative, and that tex2html_wrap_inline6829 . The temperature T controls the amount of exploration (the probability of executing actions other than the one with the highest Q-value). If T is high, or if Q-values are all the same, this will pick a random action. If T is low and Q-values are different, it will tend to pick the action with the highest Q-value.

At the start, Q is assumed to be totally inaccurate, so T is high (high exploration), and actions all have a roughly equal chance of being executed. T decreases as time goes on, and it becomes more and more likely to pick among the actions with the higher Q-values, until finally, as we assume Q is converging to tex2html_wrap_inline6781 , T approaches zero (pure exploitation) and we tend to only pick the action with the highest Q-value:


That is, the agent acts according to a stochastic policy which as time goes on becomes closer and closer to a deterministic policy.

I should have mentioned here the possibility of joint winners, in which case the final policy will still be stochastic.

In fact, while I use simple random exploration (no Boltzmann) to learn the Q-values on the small statespaces, when it came to testing the result instead of using a deterministic strict highest Q I used a low temperature Boltzmann. Strict determinism can lead to a creature with brittle behavior. If there are a number of more-or-less equally good actions with very little difference in their Q-values, we should probably rotate around them a little rather than pick the same one every time.

2.2.4 Special case - Absorbing states

An absorbing state x is one for which, tex2html_wrap_inline6839 :


Once in x, the agent can't leave. The problem with this is that it will stop us visiting all (x,a) an infinite number of times. Learning stops for all other states once the absorbing state is entered, unless there is some sort of artificial reset of the experiment.

In a real environment, where x contains information from the senses, it is hard to see how there could be such thing as an absorbing state anyway, since the external world will be constantly changing irrespective of what the agent does. It is hard to see how there could be a situation in which it could be relied on to supply the same sensory data forever.

No life-like artificial world would have absorbing states either (the one I use later certainly does not).

Chapter 3

Return to Contents page.

ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.