School of Computing. Dublin City University.
Online coding site: Ancient Brain
coders JavaScript worlds
i.e. Just sum all future rewards.
Consider this situation:
or e.g. You are Hitler. It is 1941. You have conquered Europe and are unthreatened.
So you decide to invade the
Soviet Union
and
declare war on
the US
(neither of whom were at war with you so far).
There is no way back ever to that state before invasion.
You have gone through a one-way door.
If take action a, expected long-term reward:
Ea(R) = 5 + γ 0 + γ2 0 + γ3 0 + ..
= 5
Eb(R) = 0 + γ 100 + γ2 0 + γ3 0 + ..
= 100 γ
Imagine if the infinite loops scored 1 each time round instead of 0.
Ea(R) = 5 + γ + γ2 + γ3 + ..
Eb(R) = 0 + γ 100 + γ2 + γ3 + ..
Notes on infinity:
If γ = 1, then even if one infinite loop
gives 1 forever, and the other gives 100 forever,
it can't tell the difference.
Both infinite.
In fact, all the following are the same:
1 + 1 + 1 + 1 + ...For any γ < 1 we do not have this problem.
-1000 -1000 -1000 -1000 -1000 + 1 + 1 + 1 + ...
10 + 10 + 10 + 10 + ...
If the infinite loops give reward r each time round:
Ea(R)
= 5 + γ r + γ2 r + γ3 r + ..
= 5 + γ r + r (γ2 + γ3 + .. )
Eb(R)
= 0 + γ 100 + γ2 r + γ3 r + ..
= 100 γ + r (γ2 + γ3 + .. )
So what is the point at which we always pick a?