Research - Action Selection - W-learning movie

Movie demo of W-learning in the Ant World problem

In the Ant World problem the creature exists in a toroidal gridworld, populated by static (non-moving) pieces of food and a randomly moving predator. When the creature encounters food, it picks it up. It drops food at the nest. It may only carry one piece at a time. When a piece of food is picked up, another one grows in a random location.

The creature senses things only within a small radius around it. The creature senses the direction (but not distance) of things as follows:

The creature takes actions a, which take values 0-7 (move in that direction) and 8 (stay still).

We consider the creature with 3 "brains" or "agents" in its head. The 3 agents work by Q-learning. They sense different statespaces and operate according to different reward functions.

Each timestep, the creature senses a state x, each agent inside the creature suggests an action, some agent Ak wins the internal competition and has its action a executed, then the creature senses a new state y. The caption line of the movies shows each step:

 x [Ak] a -> y

Af

First we watch the creature completely under the control of agent Af. Af senses the direction of visible food within a small radius. It also senses if it is carrying food or not. It does not sense nest or predator. It senses x = (i,f) where:

i is whether the creature is carrying food or not, and takes values 0 (not carrying) and 1 (carrying).
f (0-9) is the direction of the nearest visible food (including a value for "none visible").

It has reward function:

agent Af generates rewards for itself:
 if (just picked up food) reward = 0.7
 else reward = 0

After Q-learning it behaves as follows:

Af movie, 100 steps.
See WMV:

By Q-learning, Af builds up these Q-values. These values mean that it learns to seek out food when the creature is not carrying any, but then it is at a loss what to do. The only way it can gain any future rewards is to lose the piece of food at the nest, but it cannot learn how to do this because it does not sense the nest. So it just wanders about. If it should accidentally wander into the nest and lose its food, it immediately sets off in search of more, and once successful, will be aimless again. And so on. It completely ignores the predator.

Q. Examine the Q-values. When searching for food, and no food is in sight, is Af's wandering random?

Q. Examine the Q-values. When carrying food, is Af's wandering random?

An

Next we watch the creature under the control of agent An. An senses the direction of the nest. It senses x = (n) where:

n (0-9) is the direction of the nest (including a value for "not visible").

It has reward function:

agent An generates rewards for itself:
  if (just arrived at nest) reward = 0.1
  else reward = 0

After Q-learning it behaves as follows:

An movie, 100 steps.
See WMV:

By Q-learning, An builds up these Q-values. If the nest is not visible, An wanders randomly. Once it is visible, An heads straight to it and then, instead of staying put, learns to jump out and back in so it can get that "just arrived at nest" reward again and again! It is happy maximising its rewards, ignoring both food and predators.

Q. Examine the Q-values. When jumping out from the nest and back in, is there any pattern to An's jumps?

Ap

Then we watch the creature under the control of agent Ap. Ap senses the direction of the predator. It senses x = (p) where:

p (0-9) is the direction of the nearest visible predator.

It has reward function:

agent Ap generates rewards for itself:
  if (just shook off predator (no longer visible)) reward = 0.5
  else reward = 0

After Q-learning it behaves as follows:

Ap movie, 100 steps.
See WMV:

By Q-learning, Ap builds up these Q-values. If the predator is visible, Ap learns to move away from it. When the predator has gone out of sight, Ap doesn't actually stay put, but seems to wander randomly in the hope that the predator comes back into sight so it can get the "just shook off predator" reward again! It almost looks as if it is baiting the predator - repeatedly coming near it and then withdrawing. In fact this is an illusion. When the predator is out of sight, it cannot tell in which direction to move to see it again.

Q. Examine the Q-values. When moving away from the predator, does Ap move in the strict opposite direction?

The 3 brains competing together in the one head

So we have 3 agents, each with rather obsessive ideas about what the creature should do. We put all three into a single creature, and have them compete through W-learning for the right to control it. All three agents are going to end up somewhat frustrated.

The creature senses the entire state x = (i,n,f,p). None of the 3 agents sees this full state though. By W-learning, the competing agents build up these W-values. These values mean that:

Af is generally obeyed if the creature is not carrying food, sometimes with competition from Ap when a predator is visible.
If the creature is carrying food, Af has no strong opinions about what to do, and Ap is free to dominate if a predator is visible.
If no predator is visible, then Ap has no strong opinions either (apart from not wanting to stay still) and the weak but constant signalling of An is finally audible.

The result is a predator-avoiding, food-foraging creature in which, at every timestep, 2 of the agents are not being listened to.

We watch the creature under the control of the 3 competing agents Af, An, Ap:

3 agents movie, 300 steps.
See WMV:

Note in the caption line how control switches from agent to agent. One thing that helps the agents live together successfully is that they are all restless agents. Not one of them ever wants to stay still, no matter what is happening. This makes it easy for another agent to suggest a movement somewhere. We can draw a map of the statespace showing how control is divided up.

Summary

So, to summarise, the agents start out with random Q-values and W-values, hence the creature starts out with random behavior:

3 agents "before" movie.
See WMV:

By Q-learning, rewards are propagated into Q-values, and by W-learning, the differences between Q-values are propagated into W-values, until the creature finally settles down into a steady pattern of behavior:

3 agents "after" movie.
See WMV:

How to play Movies

File types
Speed control (run time):
- Windows Media Player - View - Enhancements - Play speed
- VLC media player has "play slower/faster" buttons beside play button

How I made these Movies

My program actually has no user interface at all. During a run, it builds these gnuplot data files. After the run, if I want to see what happened, I tell gnuplot to plot the data files. This actually causes gnuplot to play an animated sequence of images.

gnuplot

To bundle this animation into an MPEG file, I get gnuplot to dump each plot into its own pbm file. The pbm files can then be strung together frame-by-frame into an MPEG.

mpeg_encode
- readme
Speed control (build time):
1. Windows Movie Maker
2. Capture - Import video - Drag to timeline
3. Edit - Video effects - Slow down - drag to movie (repeat)
4. Finish - Save

These movies are on a "video appendix" deposited with the 1996 version (PhD 20843) of my PhD thesis in the Manuscripts Room of Cambridge University Library. This VHS video tape plays the 4 Movies above in sequence. First, the creature under the control of agent Af alone. Then An alone. Then Ap alone. Then all 3 competing together in the same body.

Movie demo of the House Robot problem