School of Computing. Dublin City University.
Online coding site: Ancient Brain
coders JavaScript worlds
The creature senses things only within a small radius around it. The creature senses the direction (but not distance) of things as follows:
The creature takes actions a, which take values 0-7 (move in that direction) and 8 (stay still).
We consider the creature with 3 "brains" or "agents" in its head. The 3 agents work by Q-learning. They sense different statespaces and operate according to different reward functions.
Each timestep, the creature senses a state x, each agent inside the creature suggests an action, some agent Ak wins the internal competition and has its action a executed, then the creature senses a new state y. The caption line of the movies shows each step:
x [Ak] a -> y
It has reward function:
agent Af generates rewards for itself: if (just picked up food) reward = 0.7 else reward = 0
After Q-learning it behaves as follows:
Af movie, 100 steps.
See WMV:
By Q-learning,
Af builds up these Q-values.
These values mean that it learns to seek out food when the creature is not carrying any,
but then it is at a loss what to do.
The only way it can gain any future rewards is to lose the piece of food at the nest,
but it cannot learn how to do this because it does not sense the nest.
So it just wanders about.
If it should accidentally wander into the nest and lose its food,
it immediately sets off in search of more, and once successful, will be aimless again.
And so on. It completely ignores the predator.
Q. Examine the Q-values.
When searching for food, and no food is in sight, is Af's wandering random?
Q. Examine the Q-values. When carrying food, is Af's wandering random?
It has reward function:
agent An generates rewards for itself: if (just arrived at nest) reward = 0.1 else reward = 0
After Q-learning it behaves as follows:
An movie, 100 steps.
See WMV:
By Q-learning, An builds up these Q-values.
If the nest is not visible, An wanders randomly.
Once it is visible, An heads straight to it and then, instead of staying put,
learns to jump out and back in so it can get that "just arrived at nest" reward again and again!
It is happy maximising its rewards, ignoring both food and predators.
Q. Examine the Q-values. When jumping out from the nest and back in, is there any pattern to An's jumps?
It has reward function:
agent Ap generates rewards for itself: if (just shook off predator (no longer visible)) reward = 0.5 else reward = 0
After Q-learning it behaves as follows:
Ap movie, 100 steps.
See WMV:
By Q-learning, Ap builds up these Q-values.
If the predator is visible, Ap learns to move away from it.
When the predator has gone out of sight, Ap doesn't actually stay put,
but seems to wander randomly in the hope that the predator comes back into sight
so it can get the "just shook off predator" reward again!
It almost looks as if it is baiting the predator - repeatedly coming near it
and then withdrawing.
In fact this is an illusion.
When the predator is out of sight, it cannot tell in which direction to move
to see it again.
Q. Examine the Q-values. When moving away from the predator, does Ap move in the strict opposite direction?
The creature senses the entire state x = (i,n,f,p). None of the 3 agents sees this full state though. By W-learning, the competing agents build up these W-values. These values mean that:
We watch the creature under the control of the 3 competing agents Af, An, Ap:
3 agents movie, 300 steps.
See WMV:
Note in the caption line how control switches from agent to agent.
One thing that helps the agents live together successfully is that they are all restless agents.
Not one of them ever wants to stay still, no matter what is happening.
This makes it easy for another agent to suggest a movement somewhere.
We can draw
a map of the statespace
showing how control is divided up.
3 agents "before" movie.
See WMV:
By Q-learning, rewards are propagated into Q-values,
and by W-learning, the differences between Q-values are propagated into W-values,
until the creature finally
settles down into a steady pattern of behavior:
3 agents "after" movie.
See WMV:
To bundle this animation into an MPEG file, I get gnuplot to dump each plot into its own pbm file. The pbm files can then be strung together frame-by-frame into an MPEG.
These movies are on a "video appendix" deposited with the 1996 version (PhD 20843) of my PhD thesis in the Manuscripts Room of Cambridge University Library. This VHS video tape plays the 4 Movies above in sequence. First, the creature under the control of agent Af alone. Then An alone. Then Ap alone. Then all 3 competing together in the same body.