Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:


Research - PhD - Chapter 8 - Chapter 9



9 W=Q (Maximize the Best Happiness)

The first response to W-learning is to ask if we need such an elaborate value of W. Why not simply have actions promoted with their Q-values, as we originally suggested back in §5.3. The agent promotes its action with the same strength no matter what (if any) its competition:

displaymath7223

and we search for an adaptive combination of tex2html_wrap_inline7019 's as before. To test a particular combination of tex2html_wrap_inline7019 's, we just multiply the base Q-values by them and then see how the creature performs under the rule W=Q. There are no W-values to learn.

If the agents share the same suite of actions, W=Q is equivalent to simply finding the action:

displaymath8194

since agents suggest their best Q over a and we take the highest W=Q over i. That is, we are only interested in the best possible individual happiness. We are going to start drawing economic analogies to our various approaches. In economic theory, this would be the equivalent of a Nietzschean social welfare function [Varian, 1993, §30], where the value of an allocation depends on the welfare of the best off agent.

The counterpart of this method would be:

displaymath8195

that is, find the action which leads to the smallest unhappiness for someone and take it. This approach is pointless because it means just obey one of the agents and cause unhappiness zero for them.

I have not seen an example of straightforward use of W=Q in Reinforcement Learning, but it can hardly be an original idea. What look like examples [Rummery and Niranjan, 1994] turn out only to be using multiple neural networks for storing Q-values tex2html_wrap_inline6989 in a monolithic (single reward function) Q-learning system and then letting through the action with the highest Q-value.

Searching for combinations of tex2html_wrap_inline7019 's under W=Q works very well, and finds the following collection which achieves a score of 15.313. Further, the memory requirements are even less, since no W-values at all are kept.

singlespace1653



9.1 Discussion

So have we wasted our time with measures of W that make compromises with the competition? Would we have been better off ignoring the competition completely?

It seems on paper that W=Q should not perform so well, since it maximizes the rewards of only one agent, while W-learning makes some attempt to maximize their collective rewards (which is roughly what the global reward is). Consider the following scenario, where there are two possible actions (1) and (2). The agents' preferred actions are highlighted:

singlespace1667

If we use W=Q, then agent tex2html_wrap_inline7399 wins (since 1.1 > 0.9), so action (1) is executed, tex2html_wrap_inline7399 gets reward 1.1, and tex2html_wrap_inline7397 gets 0. If we use the W = (D-f) method, then tex2html_wrap_inline7397 wins (since it would suffer 0.9 if it didn't, while tex2html_wrap_inline7399 would only suffer 0.1 if disobeyed), so action (2) is executed, tex2html_wrap_inline7399 gets 1, and tex2html_wrap_inline7397 gets 0.9. If the global reward / fitness is roughly a combination of the agents' rewards, then W = (D-f) is a better strategy. In short, this is the familiar ethology problem of opportunism - can tex2html_wrap_inline7397 force tex2html_wrap_inline7399 into a small diversion from its plans to pick up along the way a goal of its own?

There's one way our W=Q search will find to solve this - by just finding a high tex2html_wrap_inline8261 so that it becomes:

singlespace1681

But this is an unsatisfactory solution because it assumes that it is tex2html_wrap_inline7397 that always needs high Q-values in order for the two agents to behave opportunistically. What if in another state y, the situation is reversed and it is tex2html_wrap_inline7399 trying to ask tex2html_wrap_inline7397 for a slight diversion:

singlespace1687

Ideally we would take action (2) in both states. But W=Q will be unable to prevent action (1) being taken in at least one of the states. Currently, agent tex2html_wrap_inline7397 is losing state x and winning state y. We want it to win state x and lose state y. If we increase tex2html_wrap_inline8261 to make it win state x, we increase all Q-values across the board and make it even less likely to lose state y.

W=Q will not be able to find the opportunistic solution in cases like this, whereas W-learning will. And cases like this will be typical. Agents that ask for opportunities from other agents will themselves be asked for opportunities at other times.

In fact, any of our static measures of W, such as:

displaymath7225

would fail to be opportunistic in situations where W-learning would be. When there are more than two actions, the other agent might not be taking the worst action for tex2html_wrap_inline6859 , perhaps only the second best.

So, if we agree that W-learning will find opportunism where W=Q (or any static measure) cannot, why did W-learning not perform better? The answer seems to be that the House Robot environment does not contain problems of the nature above. It contains situations where in state x, tex2html_wrap_inline7397 wants to slightly divert tex2html_wrap_inline7399 alright, but only in situations where tex2html_wrap_inline7397 itself doesn't mind being diverted - the 0 above becomes a 0.8. This is because all behaviors here are essentially of the form "if some feature is in some direction, then move in some direction" with rewards for arriving at the feature or losing sight of it. So if tex2html_wrap_inline8301 is similar to tex2html_wrap_inline8303 , it is because actions (1) and (2) are movements in roughly the same direction, in which case tex2html_wrap_inline8309 and tex2html_wrap_inline8311 will end up similar.



9.2 Happiness and Unhappiness

Despite its name, Minimize the Worst Unhappiness (W-learning) does not mean we're always avoiding disaster. Expected reward and expected disaster are two sides of the same coin, because if the leader is not obeyed it will be unhappy. Say we have an agent who if obeyed will gain a high reward. If not obeyed, it won't suffer a punishment, just nothing interesting happens. But it might as well be a punishment since it lost the chance of that reward. It will build up a high W-value under any (D-f) scheme.

So it would be mistaken to think that the difference between Minimize the Worst Unhappiness and Maximize the Best Happiness is that one is concerned with "Unhappiness" and the other with "Happiness". As just noted, these are really the same thing. The real difference between the two approaches is that Minimize the Worst Unhappiness consults with other agents while Maximize the Best Happiness does not consult. Minimize the Worst Unhappiness tries out other agents' actions to see how bad they are. An agent in Maximize the Best Happiness only ever considers its best action.



Chapter 10

Return to Contents page.



ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.