Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:


Research - PhD - Chapter 13 - Chapter 14



14 Summary

14.1 The four approaches

There are four basic approaches to organising action selection without reference to a global reward. When the agents share the same set of actions, we can calculate all four approaches immediately. They are:

There are of course other combinations of maximizing, minimizing and summing which do not make any sense. For the full list see §F.

The first approach above has a counterpart discussed in §9. The second approach has a counterpart discussed in §13. The third and fourth approaches are counterparts of each other.





General comparisons between the action selection methods.

A dash indicates "not applicable" here.

"Number of updates per timestep" is modelled on Watkins [Watkins, 1989] where he wanted to impose limited computational demands on his creature per timestep. The format here is (number of agents) times (updates per agent). We are looking for updates that can be carried out in parallel in isolation.

In the "Ability to discriminate states" column, "Full" indicates complete ability to discern the full state. "Partial" indicates that the ability to discern the full state depends on the generalisation. "Poor" indicates that agents see subspaces only.

Are there any missing methods here? How about W=Q with full space? That would be pointless since that's just Q with full space. The agent will just learn the same Q-values repeated. Similarly, Maximize Collective Happiness with full space is just Q with full space. The W-learning methods are approximations of the approach to which they belong when actions are not shared. In which case, the practical solution is to try out being disobeyed and observe the Unhappiness, as opposed to expanding our action set and learning new Q-values. There is no equivalent of a W-learning method for the Happiness approaches.





Comparisons between the methods as applied in the House Robot problem.

A dash indicates "not applicable" here.

Full details of the experiments, and in particular how we end up comparing a single score for each method, are in §G. For discussion of the spread of scores for each method see §16.3.

Here we show the Number of updates per timestep per agent if the system is totally parallelised (each agent has its own processor). Remember here n = 8 .

Actually for testing Hierarchical Q-learning, I used n = 5 to reduce the size of the Q(x,i) space. The memory requirements shown here are for n = 8 .





Restrictions on Decentralisation.

"No" is good here.

There is "no free lunch" in decentralisation. If we want a totally decentralised model, with separate state and action spaces, we have to use either a static method like W=Q, with its inability to support opportunism, or W-learning with subspaces, with its inaccurate competition. In a world where opportunism was more important, W-learning with subspaces wouldn't necessarily be worse than W=Q, as it was here.





14.2 Winner-take-all v. Compromise actions

The action selection methods can be categorised based on the action they choose:

Note that none of these are schemes where the actions of agents are merged. Merging or averaging actions does not make sense in general but only with certain types of actions. For example, see [Scheier and Pfeifer, 1995], where all agents suggest a speed value for the left motor and one for the right motor. These type of actions are amenable to addition and subtraction.



14.3 Single-Mindedness

We will attempt to summarise both this chapter and the preceding discussion in one diagram.

Collective methods trample on the individual by trying to keep the majority happy. They are likely both to ruin some individual plans, while at the same time leading to problems with dithering, in which no goal is pursued to its logical end.

W=Q goes to the other extreme in having only one agent in charge, and perhaps suffers because it does not allow opportunism. It won't compromise with the second most needy agent.

W-learning may be a good balance between the two, allowing opportunism without the worst of dithering. One agent is generally in charge, but will be corrected by the other agents whenever it offends them too much. W-learning tries to keep everyone on board, while still going somewhere. For a further analysis of how Minimize the Worst Unhappiness gets tasks completed (enforces persistence) see §15.1.3.

[Maes, 1989, §1] lists desirable criteria for action selection schemes, and in it we see this tension between wanting actions that contribute to several goals at once and yet wanting to stick at goals until their conclusion. We can represent this in a diagram of "single-mindedness" (Figure 14.1).


Figure 14.1: The "single-mindedness" of the methods that organise action selection without reference to a global reward.



Chapter 15

Return to Contents page.



ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.