Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain


Online AI coding exercises

Project ideas

Research - PhD - Appendix B - Appendix C

C 2-reward reward functions

Consider an agent of the form:
tex2html_wrap_inline6828 	reward: if (good event) r else s 
where r > s.

C.1 Policy in Q-learning


Proof: Let us fix r and s and learn the Q-values. In a deterministic world, given a state x, the Q-value for action a will be:


for some real numbers tex2html_wrap_inline9404 . The Q-value for a different action b will be:


where tex2html_wrap_inline9410 . That is, e + f = c + d .

So whichever one of c and e is bigger defines which is the best action (which gets the larger amount of the "good" reward r), irrespective of the sizes of r > s. tex2html_wrap_inline7352

To be precise, if c > e, then Q(x,a) > Q(x,b)
Proof: Let c + d = e + f = G
Q(x,a) = c r + (G-c) s
Q(x,b) = e r + (G-e) s
Q(x,a) - Q(x,b) = (c - e) r + (-c + e) s
= (c - e) (r - s)
> 0

Note that these numbers are not integers - it may not be simply a question of the worse action receiving s instead of r a finite number of times. The worse action may also receive r instead of s at some points, and also the number of differences may in fact not be finite.

To be precise, noting that (c-e) = (f-d) , the difference between the Q-values is:


where the real number (c-e) is constant for the given two actions a and b in state x. (c-e) depends only on the probabilities of events happening, not on the specific values of the rewards r and s that we hand out when they do. Changing the relative sizes of the rewards r > s can only change the magnitude of the difference between the Q-values, but not the sign. The ranking of actions will stay the same.

For example, an agent with rewards (10,9) and an agent with rewards (10,0) will have different Q-values but will still suggest the same optimal action tex2html_wrap_inline6830 .

In a probabilistic world, we would have:


where p + q = 1 , and:


for some p' + q' = 1 .

I think this should just read:
E(rt+1 = Σ y   Pxa(y)   r(x,y)
  = Pxa(y1)   r(x,y1) + ... + Pxa(yn)   r(x,yn)
  = p' r + q' s
for some p' + q' = 1 .



for some tex2html_wrap_inline9404 as before.

C.2 Strength in W-learning


Proof: From the proof of Theorem C.1:


where tex2html_wrap_inline7994 is a constant independent of the particular rewards. tex2html_wrap_inline7352

Using our "deviation" definition, for the 2-reward agent in a deterministic world:


The size of the W-value that tex2html_wrap_inline6828 presents in state x if tex2html_wrap_inline7044 is the leader is simply proportional to the difference between its rewards. If tex2html_wrap_inline7044 wants to take the same action as tex2html_wrap_inline6828 , then tex2html_wrap_inline9480 (that is, (c-e) = 0). If the leader switches to tex2html_wrap_inline7582 , the constant switches to tex2html_wrap_inline8070 .

Increasing the difference between its rewards will cause tex2html_wrap_inline6828 to have the same disagreements with the other agents about what action to take, but higher tex2html_wrap_inline7100 values - that is, an increased ability to compete. So the progress of the W-competition will be different.

For example, an agent with rewards (8,5) will be stronger (will have higher W-values and win more competitions) than an agent with the same logic and rewards (2,0). And an agent with rewards (2,0) will be stronger than one with rewards (10,9). In particular, the strongest possible 2-reward agent is:

tex2html_wrap_inline6828 	reward: if (good event)  tex2html_wrap_inline7076  else  tex2html_wrap_inline7078  

C.3 Normalisation

Any 2-reward agent can be normalised to the form:
tex2html_wrap_inline6828 	reward: if (good event) (r-s) else 0
From Theorem C.1, this will have different Q-values but the same Q-learning policy. And from Theorem C.2, it will have identical W-values. You can regard the original agent as an (r-s), 0 agent which also picks up an automatic bonus of s every step no matter what it does. Its Q-values can be obtained by simply adding the following to each of the Q-values of the (r-s), 0 agent:


We are shifting the same contour up and down the y-axis in Figure 8.1.

The same suggested action and the same W-values means that for the purposes of W-learning it is the same agent. For example, an agent with rewards (1.5,1.1) is identical in W-learning to one with rewards (0.4,0). The W=Q method would treat them differently.

C.4 Exaggeration

Say we have a normalised 2-reward agent tex2html_wrap_inline7368:
tex2html_wrap_inline7368 	reward: if (good event) r else 0
where r > 0 .


Proof: We have just multiplied all rewards by c, so all Q-values are multiplied by c. If this is not clear, see the general proof Theorem D.1. tex2html_wrap_inline7352

I should note this only works if: c > 0

tex2html_wrap_inline7366 will have the same policy as tex2html_wrap_inline7368 , but different W-values. We are exaggerating or levelling out the contour in Figure 8.1. In particular, the strongest possible normalised agent is:

tex2html_wrap_inline6828 	reward: if (good event)  tex2html_wrap_inline7076  else 0

Appendix D

Return to Contents page.

ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.