Research - PhD - Appendix F - Appendix G

G Experimental Details

The experiments in this dissertation should not be regarded as completely definitive. As noted in §17.1, the complexity of the artificial world needs to increase if we are to properly separate the methods. Then comparisons can be run similar to the experiments here, the full details of which follow. The experiments here were all implemented in C++.

Q-learning (§4.3) - Monolithic Q-learner learns Q(x,a) using the global reward function of §4.3.1. Q-values are stored in a neural network (in fact, for convenience, this is broken into one network per action a). 100 trials, each trial interacting with the world 1400 times and then replaying experiences 30 times. Policy improves over time using Boltzmann distribution. Test over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. The architecture of the network and coding of inputs was adjusted to get the best score of average 6.285 points per 100 timesteps.

Hand-coded program (§4.3.3) - A range of strictly-hierarchical programs were designed, with both deterministic and stochastic policies. Test over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. The best scored average 8.612 points per 100 timesteps.

Hierarchical Q-learning (§4.4) - 5 small agents, with rewards 1 or 0, as in §4.4. Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world), all learning together with random winner each step. The switch then learns Q(x,i) using the global reward function of §4.3.1. Switch's Q-values are stored in a neural network (in fact, for convenience, this is broken into one network per action i). 100 trials, each trial interacting with the world 1400 times and then replaying experiences 30 times. Policy improves over time using Boltzmann distribution. Test over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. The architecture of the network and coding of inputs was adjusted to get the best score of average 13.641 points per 100 timesteps.

W-learning with subspaces (§8) - 8 small agents, with rewards or 0, as in §8. Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world), all learning together with random winner each step. Agents learn Q-values once with all . Genetic algorithm genotype is a set of 's. Population size 60. For each individual genotype, multiply base Q-values by (§8.1.2), then re-learn W-values by W-learning (without reference to global reward) over 50000 timesteps (interactions with world), then test. Test resultant creature over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. This score is the fitness function to decide who is allowed reproduce. Evolution for 30 generations found best combination of 's scoring average 13.446 points per 100 timesteps.

W=Q (§9) - 8 small agents, with rewards or 0, as in §8. Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world), all learning together with random winner each step. Agents learn Q-values once with all . Genetic algorithm genotype is a set of 's. Population size 60. For each individual genotype, multiply base Q-values by (§8.1.2), then test. No W-values to learn since W is simply Q. Test creature over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. This score is the fitness function to decide who is allowed reproduce. Evolution for 30 generations found best combination of 's scoring average 15.313 points per 100 timesteps.

W-learning with full space (§10) - 8 small agents, with rewards or 0, as in §8. Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world), all learning together with random winner each step. Agents learn Q-values once with all . Genetic algorithm genotype is a set of 's. Population size 60. For each individual genotype, multiply base Q-values by (§8.1.2), then re-learn W-values (without reference to global reward) then test. Each agent's W-values are stored in a neural network (one network for each agent). To learn W-values, do one run of 30000 timesteps (interactions with world) with random winners. Each agent then replays its experiences 10 times to learn its W-values. Test resultant creature over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. This score is the fitness function to decide who is allowed reproduce. Evolution for 30 generations found best combination of 's scoring average 14.871 points per 100 timesteps.

Negotiated W-learning (§11) - 8 small agents, with rewards or 0, as in §8. Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world), all learning together with random winner each step. Agents learn Q-values once with all . Genetic algorithm genotype is a set of 's. Population size 60. For each individual genotype, multiply base Q-values by (§8.1.2), then test. No W-values to learn since competition is resolved on the fly by Negotiated W-learning. Test creature over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. This score is the fitness function to decide who is allowed reproduce. Evolution for 30 generations found best combination of 's scoring average 18.212 points per 100 timesteps.

References

Return to Contents page.