Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

 

Search:


State-space control

  


Learning from Rewards

Instead of supervised learning (exemplars), we don't tell it correct "class" / action. Instead we give sporadic indirect feedback (a bit like "this classification was good/bad").

e.g. Move your muscles to play basketball. I can't articulate what instructions to send to your muscles / robot motors and in what order. But a child could sit there and tell you when you have scored a basket. In fact, even a machine could detect and automatically reward you when a basket is scored.




Robo-Hoops robot basketball competition (autonomous).
2008 video.



Robot playing air hockey by Reinforcement Learning.
2006 video.
See paper, Humanoid Robot Learning and Game Playing Using PC-Based Vision, by Darrin C. Bentivegna et al, in Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2002), Switzerland, 2002.




Clocksin and Moore - Traffic Junction


Translated into the terms we will be using:

  1. Observe state of the world x = (p,s)
    position and speed of car on main road
    p - 21 values
    s - 20 values
    x has 420 possible values

  2. Take action a = (c,n)
    c - which pedal - 2 values (accelerate, brake)
    n - how much (press pedal this hard) - 5 values
    10 possible actions a

  3. Observe if situation = not crossed, crossed, or collision.


Already we see typical things:

  1. Much more states than actions.
  2. Multi-dimensional both.
  3. Definition of x and a is very much under our control. Could make it more coarse-grained / fine-grained.

If tried out every possible action in every possible state, 4200 experiments to carry out.





Traditional Approach

Build model of Physics.
Take distance (p - junction)
Time for car to cover distance given speed s
Time it takes agent to cross road


Problems / Restrictions:

  1. Need model in first place. Need a controlled world. e.g. Factory environment.

  2. Model must be accurate.
    e.g. Dynamics of robot arm:

  3. World changes / Arm friction increases - Have to re-program.
    But programmer is long gone.




State-Space Approach

Look at consequences of actions.
"Let the world be its own model"
If action a worked, keep it.
If not, explore other action a2.
After many iterations, we learn the correct action patterns to any level of granularity.
And we never had to understand how the world worked!

We learn the mapping:

x, a -> y
initial state, action -> new state

  1. This approach will work whether we cross the road using wings, fins, or view the world through reverse glasses.

  2. Can adjust (re-learn) as world changes.

  3. More plausible that evolution could have worked this way (fill in the "boxes") rather than building physics models.

  4. Another reason to use state-space (or other) learning is simply when the task is tedious to program. Which may mean expensive to program - Programmers aren't free.




Learning adapting to actual laws of physics and body:
Faith, a dog born with no front legs. Learned to walk on two legs.




Can you do exhaustive search?



If one can do exhaustive search, you don't need RL or any complex learning.


More usual: Only have time to try some actions in some states.



Many mappings that we could learn:


x -> a



x,a -> y

Multiple y's:
e.g. If you are in state x and take action a
50% of the time you will end up in state y1
and 50% of the time you will end up in state y2

e.g. x = (7,5)
a = (1,5)
y1 = (6,5)
y2 = (7,5)



x,a -> quality or fitness


E(r) exists, E(y) doesn't exist

We can add rewards - to get "expected reward" (average reward you will get over many events).

Whereas adding states is meaningless:
"Expected next state = ½ (y1 + y2)"

In example above, ½ (y1 + y2) = (6 ½, 5)
Expected state?
If you take action a, do you ever go to this state?
Does this state even exist?




Clocksin and Moore paper

They mention the following without noting that both of these may be difficult:
  1. Find "neighbouring" states x.
  2. Get "average" of multiple actions a.
They make connections to what animals, children, adults do in:
  1. Play
  2. Sleep
  3. Dreaming


Writing a program to write a program

Machine writes a program x -> a only if we can think of a program that will write this program.

This may require restricting the domain. e.g. Below we will restrict ourselves to writing a stimulus-response program - well-understood model, our program will provably write an optimal solution.

Genetic Programming is a program to write any general-purpose program - Too far too fast?




RL demos and sample applications




A simple demo of RL.
The "ant" is rewarded for getting to the "nest".
Do you notice something odd in what it learns?
See full explanation.



Robot learning to flip pancakes by RL.
From Petar Kormushev.
2010 video.



Google DeepMind's Deep Q-learning (RL) playing Atari Breakout.
2015 video.



AI Learns to Escape Extreme Maze (using Deep Reinforcement Learning).
From AI Warehouse. 2025 video.


  

Mobile robots using RL

  

Boston Dynamics uses Reinforcement Learning in simulations to train the Spot robot for the real world.
2024 video.



Robot learning to walk in real world (not from simulation).
2022 video.
See paper: DayDreamer: World Models for Physical Robot Learning, by Philipp Wu et al, 2022.
Core issue: "Deep reinforcement learning is a common approach to robot learning but requires a large amount of trial and error to learn, limiting its deployment in the physical world. As a consequence, many advances in robot learning rely on simulators. On the other hand, learning inside of simulators fails to capture the complexity of the real world, is prone to simulator inaccuracies, and the resulting behaviors do not adapt to changes in the world."
Their key is learning a world model.



ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.