School of Computing. Dublin City University.
Online coding site: Ancient Brain
coders JavaScript worlds
Q-learning (§4.3) -
Monolithic Q-learner learns Q(x,a) using the
global reward function of §4.3.1.
Q-values are stored in a neural network
(in fact, for convenience, this is broken into one network per action a).
100 trials, each trial interacting with the world 1400 times
and then replaying experiences 30 times.
Policy improves over time using
Boltzmann distribution.
Test over 20000 timesteps (interactions with world)
to yield score according to
global reward function of §4.3.1.
The architecture of the network and coding of inputs was adjusted to get
the best score of average 6.285 points per 100 timesteps.
Hand-coded program (§4.3.3) -
A range of strictly-hierarchical programs were designed,
with both deterministic and stochastic policies.
Test over 20000 timesteps (interactions with world)
to yield score according to
global reward function of §4.3.1.
The best scored average 8.612 points per 100 timesteps.
Hierarchical Q-learning (§4.4) -
5 small agents, with rewards 1 or 0, as in §4.4.
Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world),
all learning together
with random winner each step.
The switch then learns Q(x,i) using the
global reward function of §4.3.1.
Switch's Q-values are stored in a neural network
(in fact, for convenience, this is broken into one network per action i).
100 trials, each trial interacting with the world 1400 times
and then replaying experiences 30 times.
Policy improves over time using
Boltzmann distribution.
Test over 20000 timesteps (interactions with world)
to yield score according to
global reward function of §4.3.1.
The architecture of the network and coding of inputs was adjusted to get
the best score of average 13.641 points per 100 timesteps.
W-learning with subspaces (§8) -
8 small agents, with rewards
or 0,
as in §8.
Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world),
all learning together
with random winner each step.
Agents learn Q-values once with all .
Genetic algorithm genotype is a set of 's.
Population size 60.
For each individual genotype, multiply base Q-values by
(§8.1.2),
then re-learn W-values by W-learning (without reference to global reward) over 50000 timesteps (interactions with world),
then test.
Test resultant creature over 20000 timesteps (interactions with world)
to yield score according to global reward function of §4.3.1.
This score is the fitness function to decide who is allowed reproduce.
Evolution for 30 generations found best combination of 's
scoring average 13.446 points per 100 timesteps.
W=Q (§9) -
8 small agents, with rewards or 0, as in §8.
Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world),
all learning together
with random winner each step.
Agents learn Q-values once with all .
Genetic algorithm genotype is a set of 's.
Population size 60.
For each individual genotype, multiply base Q-values by
(§8.1.2),
then test.
No W-values to learn since W is simply Q.
Test creature over 20000 timesteps (interactions with world)
to yield score according to global reward function of §4.3.1.
This score is the fitness function to decide who is allowed reproduce.
Evolution for 30 generations found best combination of 's
scoring average 15.313 points per 100 timesteps.
W-learning with full space (§10) -
8 small agents, with rewards or 0, as in §8.
Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world),
all learning together
with random winner each step.
Agents learn Q-values once with all .
Genetic algorithm genotype is a set of 's.
Population size 60.
For each individual genotype, multiply base Q-values by
(§8.1.2),
then re-learn W-values (without reference to global reward)
then test.
Each agent's W-values are stored in a neural network
(one network for each agent).
To learn W-values, do one run of 30000 timesteps (interactions with world)
with random winners.
Each agent then replays its experiences 10 times to learn its W-values.
Test resultant creature over 20000 timesteps (interactions with world)
to yield score according to global reward function of §4.3.1.
This score is the fitness function to decide who is allowed reproduce.
Evolution for 30 generations found best combination of 's
scoring average 14.871 points per 100 timesteps.
Negotiated W-learning (§11) -
8 small agents, with rewards or 0, as in §8.
Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world),
all learning together
with random winner each step.
Agents learn Q-values once with all .
Genetic algorithm genotype is a set of 's.
Population size 60.
For each individual genotype, multiply base Q-values by
(§8.1.2),
then test.
No W-values to learn since competition is resolved on the fly by Negotiated W-learning.
Test creature over 20000 timesteps (interactions with world)
to yield score according to global reward function of §4.3.1.
This score is the fitness function to decide who is allowed reproduce.
Evolution for 30 generations found best combination of 's
scoring average 18.212 points per 100 timesteps.
Return to Contents page.