Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:

Free AI exercises


Research - PhD - Appendix F - Appendix G



G Experimental Details

The experiments in this dissertation should not be regarded as completely definitive. As noted in §17.1, the complexity of the artificial world needs to increase if we are to properly separate the methods. Then comparisons can be run similar to the experiments here, the full details of which follow. The experiments here were all implemented in C++.



Q-learning (§4.3) - Monolithic Q-learner learns Q(x,a) using the global reward function of §4.3.1. Q-values are stored in a neural network (in fact, for convenience, this is broken into one network per action a). 100 trials, each trial interacting with the world 1400 times and then replaying experiences 30 times. Policy improves over time using Boltzmann distribution. Test over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. The architecture of the network and coding of inputs was adjusted to get the best score of average 6.285 points per 100 timesteps.


Hand-coded program (§4.3.3) - A range of strictly-hierarchical programs were designed, with both deterministic and stochastic policies. Test over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. The best scored average 8.612 points per 100 timesteps.


Hierarchical Q-learning (§4.4) - 5 small agents, with rewards 1 or 0, as in §4.4. Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world), all learning together with random winner each step. The switch then learns Q(x,i) using the global reward function of §4.3.1. Switch's Q-values are stored in a neural network (in fact, for convenience, this is broken into one network per action i). 100 trials, each trial interacting with the world 1400 times and then replaying experiences 30 times. Policy improves over time using Boltzmann distribution. Test over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. The architecture of the network and coding of inputs was adjusted to get the best score of average 13.641 points per 100 timesteps.


W-learning with subspaces (§8) - 8 small agents, with rewards tex2html_wrap_inline6988 or 0, as in §8. Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world), all learning together with random winner each step. Agents learn Q-values once with all tex2html_wrap_inline9690 . Genetic algorithm genotype is a set of tex2html_wrap_inline6988 's. Population size 60. For each individual genotype, multiply base Q-values by tex2html_wrap_inline69888.1.2), then re-learn W-values by W-learning (without reference to global reward) over 50000 timesteps (interactions with world), then test. Test resultant creature over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. This score is the fitness function to decide who is allowed reproduce. Evolution for 30 generations found best combination of tex2html_wrap_inline6988 's scoring average 13.446 points per 100 timesteps.


W=Q (§9) - 8 small agents, with rewards tex2html_wrap_inline6988 or 0, as in §8. Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world), all learning together with random winner each step. Agents learn Q-values once with all tex2html_wrap_inline9690 . Genetic algorithm genotype is a set of tex2html_wrap_inline6988 's. Population size 60. For each individual genotype, multiply base Q-values by tex2html_wrap_inline69888.1.2), then test. No W-values to learn since W is simply Q. Test creature over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. This score is the fitness function to decide who is allowed reproduce. Evolution for 30 generations found best combination of tex2html_wrap_inline6988 's scoring average 15.313 points per 100 timesteps.


W-learning with full space (§10) - 8 small agents, with rewards tex2html_wrap_inline6988 or 0, as in §8. Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world), all learning together with random winner each step. Agents learn Q-values once with all tex2html_wrap_inline9690 . Genetic algorithm genotype is a set of tex2html_wrap_inline6988 's. Population size 60. For each individual genotype, multiply base Q-values by tex2html_wrap_inline69888.1.2), then re-learn W-values (without reference to global reward) then test. Each agent's W-values are stored in a neural network (one network for each agent). To learn W-values, do one run of 30000 timesteps (interactions with world) with random winners. Each agent then replays its experiences 10 times to learn its W-values. Test resultant creature over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. This score is the fitness function to decide who is allowed reproduce. Evolution for 30 generations found best combination of tex2html_wrap_inline6988 's scoring average 14.871 points per 100 timesteps.


Negotiated W-learning (§11) - 8 small agents, with rewards tex2html_wrap_inline6988 or 0, as in §8. Agents learn Q-values for their local reward functions by random exploration of 300000 timesteps (interactions with world), all learning together with random winner each step. Agents learn Q-values once with all tex2html_wrap_inline9690 . Genetic algorithm genotype is a set of tex2html_wrap_inline6988 's. Population size 60. For each individual genotype, multiply base Q-values by tex2html_wrap_inline69888.1.2), then test. No W-values to learn since competition is resolved on the fly by Negotiated W-learning. Test creature over 20000 timesteps (interactions with world) to yield score according to global reward function of §4.3.1. This score is the fitness function to decide who is allowed reproduce. Evolution for 30 generations found best combination of tex2html_wrap_inline6988 's scoring average 18.212 points per 100 timesteps.



References

Return to Contents page.



ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.