Introduction to the fundamentals of reinforcement learning.

Resources

David Silver's Course Slides: '15
- 01-Introduction (slides, video)

What is it

One of the Machine Learning category of techniques. Others include Supervised/Unsupervised Learning
There is no supervisor, only a reward signal,
Feedback is delayed, not instantaneous
Time really matters (sequential ...)
Agent's actions affect the subsequent data it receives
All goals can be described by the maximisation of expected cumulative reward (Reward Hypothesis)
Sequential Decision Making
- Goal: select actions to maximise total future reward
- Actions may have long term consequences
- Reward may be delayed
- It may be better to sacrifice immediate reward to gain more long term reward
- Examples:
  - Blocking opponent moves (might help winning chances many moves from now)

Definitions

The History is the sequence of observations, actions ,rewards. The fundamental idea is for both agent and environment to use this to determine what happens next.

The history is often not useful because it is too large, instead we often use the State to be a summary of the history, and use that to determine what happens next.

and is composed of both the agent and environment state (not usually visible to the agent), separately.

A State is Markov if and only if , i.e. "The future is independent of the past given the present".

Full observability: agent directly observes the environment state, . Formally, this is then a Markov Decision Process.

Partial observability: the agent only indirectly observes the environment, e.g. a robot with camera is not told its position. The agent may construct it's own state representation . This can lead to having beliefs of the environment state (they often use recurrent neural networks here?).

The Agent

Components

A reinforcement learning agent may include one or more of the following components. 1) Policy: agent's behaviour function 2) Value Function: how good is each state and/or action 3) Model: agent's representation of the environment.

A Policy is a map from state to action, e.g. Deterministic Policy: or Stochastic Policy:

A Value Function is a prediction of future reward, used to evaluate the goodness/badness of states and therefore to select between actions, e.g.

A Model predicts what the environment will do next. predicts the next state, predicts the next (immediate) reward, e.g.

There is a maze example in David Silver's first slide deck which exhibits policies, value functions and models well.

Categorisations

For value-based, policy can be implicitly derived from the values. For policy based, it can try and optimise policies themselves directly from reward feedback.

Gedanken Experiments

With a Model vs Without

Learning Without a Model (aka Experimentation)	Learning With A Model (aka Deliberate)
The environment is initially unknown The agent interacts with its environment The agent improves its policy	A model of the environment is known The agent performs computations with its model The agent improves its policy

Without a Model	With a Model

Rules of the game are unknown, requires experimentation.	The agent can query the emulator, deliberate planning ahead to find the optimal policy.

When the environment is initially unknown, Exploration finds more information about the environment whereas Exploitation exploits known information to maximise reward. Both are important.

Prediction vs Control

Prediction: evaluate the future, given a policy
- Question: What is the value function for the given policy?
Control: optimise the future, find the best policy
- Question: What is (is there?) the optimal value function, , over all possible policies
- Question: What is the optimal policy, ,?

Technical Notes

Fundamentals