Value Function Approximation

Approximating the state or action value function.

Resources

Goal

Too many states to consider  or . e.g. Go is in the order of 

  • Generalise from the seen to the unseen
  • Update parameters w using Monte Carlo or TD Learning

Which function approximator? Focus on methods with a gradient.

  • Linear combinations of features
  • Neural Network
  • Decision Tree
  • Nearest Neighbour
  • Fourier / Wavelet Bases

Gradient Descent

Cheating - There Exists an Oracle

That is, the oracle provides the real value, like a supervisor. Find a parameter vector w, minimising mean squared error between approximate value function and true value function:

Stochastic gradient descent samples the gradient. i.e.

Linear Combination

Introduce a feature vector x(S) that 'represent' the entirety of the state. E.g. distance from the wall. This is essentially putting 'our knowledge' into the problem to reduce the dimensionality of the state. This is a black art?

If we don't have an Oracle

Substitute in the target from one of the previous model free methods:

MethodFunction ApproximatorUpdate
MC
(noisy oracle)
General

Linear Combination
TD(0)
(biased oracle)
General

Linear Combination
TD(lambda)
  • Applies as well to action value learning.

Convergence

Convergence is a problem!

Convergence of the prediction algorithms - the bias introduced by bootstrapping can theoretically cause it to diverge (even if in practice it works). There are some recent approaches that can fix the TD problems (Gradient TD / Emphatic TD). The bias introduced by bootstrapping can theoretically cause it to diverge (even if in practice it works). There are some recent approaches that can fix the TD problems (Gradient TD / Emphatic TD).

Convergence of the control algorithms

Batch

  • Experience Replay - store the datasets generated from previous (limited perspective) algorithms, sample from it, and then optimise the towards the value function (linear combination, the weights).
    • i.e. use the entire experience to optimise towards a target. Squeeze more out of the data!
  • Deep Q-Networks - uses experience replay and fixed Q-targets, algorithm skeleton in Silver's slides.
    • This is a stable algorithm!
      • Experience replay decorrelates the trajectories
      • Two networks (didn't really catch the why of this reason, something about bootstrapping towards a frozen target from some time ago rather than the latest, freshest parameters, this causes spiralling problems)

Linear Least Squares Methods

  • Go straight to the solution by setting the expected weight change to zero and solving for the weights directly.
  • Only works for linear problems
  • Can improve the situation for policy learning