Value Function Approximation
Resources
- David Silver's Course Slides: '15
- 01 - Introduction (slides, video)
- 02 - Markov Decision Processes (slides, video)
- 03 - Dynamic Programming (slides, video)
- 04 - Model Free Prediction (slides, video)
- 05 - Model Free Control (slides, video)
- 06 - Value Function Approximation (slides, video)
- 07 - Policy Gradient Methods (slides, video)
- 08 - Integrating Learning & Planning (slides, video)
Goal
Too many states to consider or . e.g. Go is in the order of
- Generalise from the seen to the unseen
- Update parameters w using Monte Carlo or TD Learning
Which function approximator? Focus on methods with a gradient.
- Linear combinations of features
- Neural Network
- Decision Tree
- Nearest Neighbour
- Fourier / Wavelet Bases
Gradient Descent
Cheating - There Exists an Oracle
That is, the oracle provides the real value, like a supervisor. Find a parameter vector w, minimising mean squared error between approximate value function and true value function:
Stochastic gradient descent samples the gradient. i.e.
Linear Combination
Introduce a feature vector x(S) that 'represent' the entirety of the state. E.g. distance from the wall. This is essentially putting 'our knowledge' into the problem to reduce the dimensionality of the state. This is a black art?
If we don't have an Oracle
Substitute in the target from one of the previous model free methods:
Method | Function Approximator | Update | ||
---|---|---|---|---|
MC (noisy oracle) | General | |||
Linear Combination | ||||
TD(0) (biased oracle) | General | |||
Linear Combination | ||||
TD(lambda) |
- Applies as well to action value learning.
Convergence
Convergence is a problem!
Convergence of the prediction algorithms - the bias introduced by bootstrapping can theoretically cause it to diverge (even if in practice it works). There are some recent approaches that can fix the TD problems (Gradient TD / Emphatic TD). The bias introduced by bootstrapping can theoretically cause it to diverge (even if in practice it works). There are some recent approaches that can fix the TD problems (Gradient TD / Emphatic TD).
Convergence of the control algorithms
Batch
- Experience Replay - store the datasets generated from previous (limited perspective) algorithms, sample from it, and then optimise the towards the value function (linear combination, the weights).
- i.e. use the entire experience to optimise towards a target. Squeeze more out of the data!
- Deep Q-Networks - uses experience replay and fixed Q-targets, algorithm skeleton in Silver's slides.
- This is a stable algorithm!
- Experience replay decorrelates the trajectories
- Two networks (didn't really catch the why of this reason, something about bootstrapping towards a frozen target from some time ago rather than the latest, freshest parameters, this causes spiralling problems)
- This is a stable algorithm!
Linear Least Squares Methods
- Go straight to the solution by setting the expected weight change to zero and solving for the weights directly.
- Only works for linear problems
- Can improve the situation for policy learning