Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:


Research - PhD - Chapter 9 - Chapter 10



10 W-learning with full space

A further reason why W-learning underperformed is that we still haven't found the ideal version of W-learning. Remember that using only subspaces for tex2html_wrap_inline7131 results in a loss of accuracy. Using the full space for tex2html_wrap_inline7131 would result in a more sophisticated competition.

Consider the competition between the dirt-seeker tex2html_wrap_inline7115 and the smoke-seeker tex2html_wrap_inline7815 . For simplicity, let the global state be x = (d,f). tex2html_wrap_inline7115 sees only states (d), and tex2html_wrap_inline7815 sees only (f). When the full state is x = (d,5), tex2html_wrap_inline7815 simply sees all these as state (5), that is, smoke is in direction 5. Sometimes tex2html_wrap_inline7115 opposes it, and sometimes, for no apparent reason, it doesn't. But tex2html_wrap_inline8341 averages all these together into one variable. It is a crude form of competition, since tex2html_wrap_inline7815 must present the same W-value in many different situations where its competition will want to do quite different things. The agents might be better able to exploit their opportunities if they could tell the real states apart and present different W-values in each one.

If we are to make the x in the tex2html_wrap_inline7131 refer to the full state, then each agent needs a single neural network to implement the function. The agent's neural network takes a vector input x and produces a floating point output tex2html_wrap_inline7131 . The Q-values can remain as subspaces of course. We are back basically to the same memory requirements as Hierarchical Q-learning - subspaces for the Q-values and then n times the full state x.



10.1 Strict highest W

Recall that if the winner is to be the strict highest W we start with W random negative, and have the leading tex2html_wrap_inline6305 unchanged, waiting to be overtaken. This works for lookup tables, but will not work with neural networks. First because trying to initialise W to random negative is pointless since the network's values will make large jumps up and down in the early stages when its weights are untuned. Second because even if we do not update it, tex2html_wrap_inline6305 will still change as the other tex2html_wrap_inline8361 change. And if the net doesn't see tex2html_wrap_inline8363 for a while, it will forget it.

We could think of various methods to try to repeatedly clamp tex2html_wrap_inline6305 , but it seems all would need extra memory to remember what value it should be clamped to.



10.2 Stochastic highest W

The approach we took instead was: Start with W random. Do one run of 30000 steps with random winners so that everyone experiences what it's like to lose, and remembers these experiences. Then they each replay their experiences 10 times to learn from them properly. Note that when learning W-values in a neural network, we are just doing updates of the form tex2html_wrap_inline8371 . No W-value is referenced on the right-hand side, unlike the case of learning the Q-values. Hence there is no need for our concept of backward replay.

With a similar neural network architecture as before, the best combination of agents found, scoring 14.871, was:

singlespace1736

which is better than W-learning with subspaces, but still not as good as W=Q. A problem with this method of random winners is that it will actually build up each tex2html_wrap_inline7131 to be the average loss over all other agents in the lead:

displaymath8367

for tex2html_wrap_inline8391 . So what we are doing is in fact finding:

displaymath8368

This sum doesn't really mean anything. For example, it is certainly not the loss that the current leader is causing for the agent.

Using random winners is equivalent to a stochastic highest W strategy with fixed high temperature. We would probably have got better results if we had used a more normal stochastic highest W - one with a declining temperature. This would have multiple trials, replay after each trial, and a declining temperature over time as in §4.3.2. But we have some confirmation that telling states apart is a good thing. In the next section, we find out what happens when we can tell states apart perfectly.



Chapter 11

Return to Contents page.



ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.