Dynamic Programming

Markov decision processes formally describe an environment for Reinforcement Learning when the environment is fully observable.

Resources

About

Dynamic programming can be applied to problems which have optimal substructure, i.e. satisfy the principle of optimality and ideally, have recurring subproblems which can be sped up (e.g. using caching). Dynamic programming can be applied to the Bellman Equation for MDP's - which can be recursively decomposed into subproblems, i.e. the now + remainder.

  • It is used for planning of the MDP (model is known, no interaction with the environment, improves the policies), needs full MDP.
  • For Evaluation/Prediction - given an MDP and policy, what is the value function ?
  • For Learning/Control - given an MDP, what is the optimal value function , and optimal policy 

Uses evaluation in an inner loop to do learning of the best policy to follow.

Policy Evaluation

Start out with an initial guess for the policy value function, e.g. zero everywhere. Then use the bellman equation and iterate on that. The process is via synchronous backups (there is also asynchronous backups as well):

Policy Evaluation & Improvement

  • Evaluate a policy by finding it's value function
  • Improve a policy
    • e.g. a greedy improvement is to choose actions with preference towards moving to a state with a better value function (i.e. uses the action value function to prioritise the new policy action)
  • This process always converges to the optimal value function and policy.

Bonus:

  • You can intersperse evaluation and improvement steps with the same effect, just not with k=1, but any other multiple of that is fine.
  • The benefit is that you can get convergence on  before you converge on 

Value Iteration

If you don't have a policy, but you do have a known optimal value at a specific state you can work the non-policy Bellman equations to do synchronous backups:

  • Values functions generated by this do not necessarily match any real policy (they are just numeric iterations)
  • But they do converge to the optimal value function, and thus the optimal policy

Action Iteration

Possible to do iterations similar to those above using action value iterations too. They let you get away from having to know the transition dynamics...

Summary

ProblemBellman EquationAlgorithm
PredictionBellman Expectation EquationPolicy Evaluation
ControlBellman Expectation Equation + Greedy Policy ImprovementPolicy Iteration
ControlBellman Optimality EquationValue Iteration

Proofs

  • How do we know that value iteration converges to ?
  • Or that iterative policy evaluation converges to ?
  • And therefore that policy iteration converges to ?
  • Is the solution unique?
  • How fast do these algorithms converge?

These questions are resolved by the contraction mapping theorem.