Fundamentals

Fundamentals

Introduction to the fundamentals of reinforcement learning.




Resources

What is it

  • One of the Machine Learning category of techniques. Others include Supervised/Unsupervised Learning
  • There is no supervisor, only a reward signal, 
    Loading
  • Feedback is delayed, not instantaneous
  • Time really matters (sequential ...)
  • Agent's actions affect the subsequent data it receives
  • All goals can be described by the maximisation of expected cumulative reward (Reward Hypothesis)
  • Sequential Decision Making
    • Goal: select actions to maximise total future reward
    • Actions may have long term consequences
    • Reward may be delayed
    • It may be better to sacrifice immediate reward to gain more long term reward
    • Examples:
      • Blocking opponent moves (might help winning chances many moves from now)


Definitions

The History is the sequence of observations, actions ,rewards. The fundamental idea is for both agent and environment to use this to determine what happens next.

Loading

The history is often not useful because it is too large, instead we often use the State to be a summary of the history, and use that to determine what happens next.

Loading

and is composed of both the agent and environment state (not usually visible to the agent), separately.

A State 

Loading
 is Markov if and only if 
Loading
, i.e. "The future is independent of the past given the present".

Full observability: agent directly observes the environment state, 

Loading
. Formally, this is then a Markov Decision Process.

Partial observability: the agent only indirectly observes the environment, e.g. a robot with camera is not told its position. The agent may construct it's own state representation 

Loading
. This can lead to having beliefs of the environment state (they often use recurrent neural networks here?).

The Agent

Components

A reinforcement learning agent may include one or more of the following components. 1) Policy: agent's behaviour function 2) Value Function: how good is each state and/or action 3) Model: agent's representation of the environment.

A Policy is a map from state to action, e.g. Deterministic Policy

Loading
 or Stochastic Policy
Loading

A Value Function is a prediction of future reward, used to evaluate the goodness/badness of states and therefore to select between actions, e.g. 

Loading

Model predicts what the environment will do next. 

Loading
 predicts the next state, 
Loading
 predicts the next (immediate) reward, e.g.

Loading

There is a maze example in David Silver's first slide deck which exhibits policies, value functions and models well.

Categorisations

For value-based, policy can be implicitly derived from the values. For policy based, it can try and optimise policies themselves directly from reward feedback.

Gedanken Experiments

With a Model vs Without

Learning Without a Model (aka Experimentation)Learning With A Model (aka Deliberate)
  • The environment is initially unknown
  • The agent interacts with its environment
  • The agent improves its policy
  • A model of the environment is known
  • The agent performs computations with its model
  • The agent improves its policy
Without a ModelWith a Model
Rules of the game are unknown, requires experimentation.

The agent can query the emulator, deliberate planning ahead
to find the optimal policy.


When the environment is initially unknown, Exploration finds more information about the environment whereas Exploitation exploits known information to maximise reward. Both are important.

Prediction vs Control

  • Prediction: evaluate the future, given a policy
    • Question: What is the value function for the given policy?
  • Control: optimise the future, find the best policy
    • Question: What is (is there?) the optimal value function,
      Loading
      , over all possible policies
    • Question: What is the optimal policy,
      Loading
      ,?