Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:


Research - PhD - Appendix C - Appendix D



D 3-reward (or more) reward functions

For 3-reward (or more) agents the relative sizes of the rewards do matter for the Q-learning policy. Consider an agent of the form:
  
tex2html_wrap_inline6828 	reward: if (best event)  tex2html_wrap_inline8468  else if (good event)  tex2html_wrap_inline8230  else  tex2html_wrap_inline9546  
where tex2html_wrap_inline9548 .



D.1 Policy in Q-learning

We show by an example that changing one reward in this agent while keeping others fixed can lead to a switch of policy. Imagine that currently actions a and b lead to the following sequences of rewards:

 (x,a) leads to sequence  tex2html_wrap_inline9556  and then  tex2html_wrap_inline9558  forever

(x,b) leads to sequence tex2html_wrap_inline9562 and then tex2html_wrap_inline9558 forever

Currently action b is the best. We lose tex2html_wrap_inline9568 on the fifth step certainly, but we make up for it by receiving the payoff from tex2html_wrap_inline9570 in the first four steps. However, if we start to increase the size of tex2html_wrap_inline8468 , while keeping tex2html_wrap_inline8230 and tex2html_wrap_inline9546 the same, we can eventually make action a the most profitable path to follow and cause a switch in policy.



D.2 Strength in W-learning

Because increasing the gaps between rewards may switch policy, we can't say that in general it will increase W-values. In the example above, say the leader tex2html_wrap_inline7044 was suggesting (and executing) action a all along. By increasing the gaps between our rewards, we suddenly want to take action a ourself, so tex2html_wrap_inline7226 .

Increasing the difference between its rewards may cause tex2html_wrap_inline6828 to have new disagreements, and maybe new agreements, with the other agents about what action to take, so the progress of the W-competition may be radically different. Once a W-value changes, we have to follow the whole re-organisation to its conclusion.

What we can say is that multiplying all rewards by the same constant (see §D.4 shortly), and hence multiplying all Q-values by the constant, will increase or decrease the size of all W-values without changing the policy.



D.3 Normalisation

Any agent with rewards tex2html_wrap_inline9590 can be normalised to one with rewards tex2html_wrap_inline9592 . The original agent can be viewed as a normalised one which also picks up tex2html_wrap_inline9594 every timestep no matter what.

The normalised agent will have the same policy and the same W-values.



D.4 Exaggeration

theorem3549

Think of it as changing the "unit of measurement" of the rewards.

Proof: When we take action a in state x, let tex2html_wrap_inline9614 be the probability that reward tex2html_wrap_inline6988 is given to tex2html_wrap_inline7368 (and therefore that reward tex2html_wrap_inline9620 is given to tex2html_wrap_inline7366 ). Then tex2html_wrap_inline7366 's expected reward is simply c times tex2html_wrap_inline7368 's expected reward:

displaymath9597

It follows from the definitions in §2.1 that tex2html_wrap_inline9630 and tex2html_wrap_inline9632. tex2html_wrap_inline7352

I should note this only works if: c > 0


tex2html_wrap_inline7366 will have the same policy as tex2html_wrap_inline7368 , but larger or smaller W-values.



Appendix E

Return to Contents page.



ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.