Fundamentals
Resources
What is it
- One of the Machine Learning category of techniques. Others include Supervised/Unsupervised Learning
- There is no supervisor, only a reward signal,
- Feedback is delayed, not instantaneous
- Time really matters (sequential ...)
- Agent's actions affect the subsequent data it receives
- All goals can be described by the maximisation of expected cumulative reward (Reward Hypothesis)
- Sequential Decision Making
- Goal: select actions to maximise total future reward
- Actions may have long term consequences
- Reward may be delayed
- It may be better to sacrifice immediate reward to gain more long term reward
- Examples:
- Blocking opponent moves (might help winning chances many moves from now)
Definitions
The History is the sequence of observations, actions ,rewards. The fundamental idea is for both agent and environment to use this to determine what happens next.
The history is often not useful because it is too large, instead we often use the State to be a summary of the history, and use that to determine what happens next.
and is composed of both the agent and environment state (not usually visible to the agent), separately.
A State is Markov if and only if , i.e. "The future is independent of the past given the present".
Full observability: agent directly observes the environment state, . Formally, this is then a Markov Decision Process.
Partial observability: the agent only indirectly observes the environment, e.g. a robot with camera is not told its position. The agent may construct it's own state representation . This can lead to having beliefs of the environment state (they often use recurrent neural networks here?).
The Agent
Components
A reinforcement learning agent may include one or more of the following components. 1) Policy: agent's behaviour function 2) Value Function: how good is each state and/or action 3) Model: agent's representation of the environment.
A Policy is a map from state to action, e.g. Deterministic Policy: or Stochastic Policy:
A Value Function is a prediction of future reward, used to evaluate the goodness/badness of states and therefore to select between actions, e.g.
A Model predicts what the environment will do next. predicts the next state, predicts the next (immediate) reward, e.g.
There is a maze example in David Silver's first slide deck which exhibits policies, value functions and models well.
Categorisations
For value-based, policy can be implicitly derived from the values. For policy based, it can try and optimise policies themselves directly from reward feedback.
Gedanken Experiments
With a Model vs Without
Learning Without a Model (aka Experimentation) | Learning With A Model (aka Deliberate) |
---|---|
|
|
Without a Model | With a Model |
---|---|
Rules of the game are unknown, requires experimentation. | The agent can query the emulator, deliberate planning ahead |
When the environment is initially unknown, Exploration finds more information about the environment whereas Exploitation exploits known information to maximise reward. Both are important.
Prediction vs Control
- Prediction: evaluate the future, given a policy
- Question: What is the value function for the given policy?
- Control: optimise the future, find the best policy
- Question: What is (is there?) the optimal value function, , over all possible policies
- Question: What is the optimal policy, ,?