School of Computing. Dublin City University.
Online coding site:
Ancient Brain
coders JavaScript worlds
The typical goes from 1 down to 0,
but note that if the conditions hold, then for any t,
and
,
so
may start anywhere along the sequence.
That is,
may take successive values
e.g. Say world changes from MDP1 to MDP2 after time t. Just keep going with Q-learning and will learn optimal policy for MDP2 (eventually) and will forget what it learnt for MDP1 (eventually). No need to change anything.
Q-learning automatically adapts if world/problem changes.
Let be samples of a stationary random variable d
with expected value E(d).
Repeat:
Proof: D's updates go:
As :
that is,
.
One way of looking at this
is to consider as the average of all samples before time t,
samples which are now irrelevant for some reason.
We can consider them as samples from a different distribution f:
Hence:
as .
Because:
If start at:
α = 1/t
then initial Q-values bias our Q-values for some time.
And since we only run for finite time in any finite experiment,
the bias may still be there after learning.
Consider being "born" with Q-values already filled in (i.e. in DNA) and then start learning:
a1 a2 Q(x,a) 0 0good Q-values to be born with:
a1 a2 Q(x,a) -1000 0- even if experiment in childhood, with a moderate temperature Boltzmann control policy, still unlikely to try a1.