Approximating the state or action value function.

Resources

David Silver's Course Slides: '15
- 01 - Introduction (slides, video)
- 02 - Markov Decision Processes (slides, video)
- 03 - Dynamic Programming (slides, video)
- 04 - Model Free Prediction (slides, video)
- 05 - Model Free Control (slides, video)
- 06 - Value Function Approximation (slides, video)
- 07 - Policy Gradient Methods (slides, video)
- 08 - Integrating Learning & Planning (slides, video)

Goal

Too many states to consider or . e.g. Go is in the order of

Generalise from the seen to the unseen
Update parameters w using Monte Carlo or TD Learning

Which function approximator? Focus on methods with a gradient.

Linear combinations of features
Neural Network
Decision Tree
Nearest Neighbour
Fourier / Wavelet Bases

Gradient Descent

Cheating - There Exists an Oracle

That is, the oracle provides the real value, like a supervisor. Find a parameter vector w, minimising mean squared error between approximate value function and true value function:

Stochastic gradient descent samples the gradient. i.e.

Linear Combination

Introduce a feature vector x(S) that 'represent' the entirety of the state. E.g. distance from the wall. This is essentially putting 'our knowledge' into the problem to reduce the dimensionality of the state. This is a black art?

If we don't have an Oracle

Substitute in the target from one of the previous model free methods:

Method

Function Approximator

Update

MC
(noisy oracle)

General

Linear Combination

TD(0)
(biased oracle)

General

Linear Combination

TD(lambda)

Applies as well to action value learning.

Convergence

Convergence is a problem!

Convergence of the prediction algorithms - the bias introduced by bootstrapping can theoretically cause it to diverge (even if in practice it works). There are some recent approaches that can fix the TD problems (Gradient TD / Emphatic TD). The bias introduced by bootstrapping can theoretically cause it to diverge (even if in practice it works). There are some recent approaches that can fix the TD problems (Gradient TD / Emphatic TD).

Convergence of the control algorithms

Batch

Experience Replay - store the datasets generated from previous (limited perspective) algorithms, sample from it, and then optimise the towards the value function (linear combination, the weights).
- i.e. use the entire experience to optimise towards a target. Squeeze more out of the data!
Deep Q-Networks - uses experience replay and fixed Q-targets, algorithm skeleton in Silver's slides.
- This is a stable algorithm!
  - Experience replay decorrelates the trajectories
  - Two networks (didn't really catch the why of this reason, something about bootstrapping towards a frozen target from some time ago rather than the latest, freshest parameters, this causes spiralling problems)

Linear Least Squares Methods

Go straight to the solution by setting the expected weight change to zero and solving for the weights directly.
Only works for linear problems
Can improve the situation for policy learning

Value Function Approximation

Resources

Goal

Gradient Descent

Cheating - There Exists an Oracle

Linear Combination

If we don't have an Oracle

Convergence

Batch

Linear Least Squares Methods