Dynamic Programming
Resources
About
Dynamic programming can be applied to problems which have optimal substructure, i.e. satisfy the principle of optimality and ideally, have recurring subproblems which can be sped up (e.g. using caching). Dynamic programming can be applied to the Bellman Equation for MDP's - which can be recursively decomposed into subproblems, i.e. the now + remainder.
- It is used for planning of the MDP (model is known, no interaction with the environment, improves the policies), needs full MDP.
- For Evaluation/Prediction - given an MDP and policy, what is the value function ?
- For Learning/Control - given an MDP, what is the optimal value function , and optimal policy
Uses evaluation in an inner loop to do learning of the best policy to follow.
Policy Evaluation
Start out with an initial guess for the policy value function, e.g. zero everywhere. Then use the bellman equation and iterate on that. The process is via synchronous backups (there is also asynchronous backups as well):
Policy Evaluation & Improvement
- Evaluate a policy by finding it's value function
- Improve a policy
- e.g. a greedy improvement is to choose actions with preference towards moving to a state with a better value function (i.e. uses the action value function to prioritise the new policy action)
- This process always converges to the optimal value function and policy.
Bonus:
- You can intersperse evaluation and improvement steps with the same effect, just not with k=1, but any other multiple of that is fine.
- The benefit is that you can get convergence on before you converge on
Value Iteration
If you don't have a policy, but you do have a known optimal value at a specific state, you can work the non-policy Bellman equations to do synchronous backups:
- Values functions generated by this do not necessarily match any real policy (they are just numeric iterations)
- But they do converge to the optimal value function, and thus the optimal policy
Action Iteration
Possible to do iterations similar to those above using action value iterations too. They let you get away from having to know the transition dynamics...
Summary
Problem | Bellman Equation | Algorithm |
---|---|---|
Prediction | Bellman Expectation Equation | Policy Evaluation |
Control | Bellman Expectation Equation + Greedy Policy Improvement | Policy Iteration |
Control | Bellman Optimality Equation | Value Iteration |
Proofs
- How do we know that value iteration converges to ?
- Or that iterative policy evaluation converges to ?
- And therefore that policy iteration converges to ?
- Is the solution unique?
- How fast do these algorithms converge?
These questions are resolved by the contraction mapping theorem.