Markov decision processes formally describe an environment for Reinforcement Learning when the environment is fully observable.

Resources

David Silver's Course Slides: '15
- 01-Introduction (slides, video)
- 02-Markov Decision Processes (slides, video)
- 03-Dynamic Programming(slides, video)

About

Dynamic programming can be applied to problems which have optimal substructure, i.e. satisfy the principle of optimality and ideally, have recurring subproblems which can be sped up (e.g. using caching). Dynamic programming can be applied to the Bellman Equation for MDP's - which can be recursively decomposed into subproblems, i.e. the now + remainder.

It is used for planning of the MDP (model is known, no interaction with the environment, improves the policies), needs full MDP.
For Evaluation/Prediction - given an MDP and policy, what is the value function
Loading
?
For Learning/Control - given an MDP, what is the optimal value function
Loading
, and optimal policy
Loading

Uses evaluation in an inner loop to do learning of the best policy to follow.

Policy Evaluation

Start out with an initial guess for the policy value function, e.g. zero everywhere. Then use the bellman equation and iterate on that. The process is via synchronous backups (there is also asynchronous backups as well):

Loading

Policy Evaluation & Improvement

Evaluate a policy by finding it's value function
Improve a policy
- e.g. a greedy improvement is to choose actions with preference towards moving to a state with a better value function (i.e. uses the action value function to prioritise the new policy action)
This process always converges to the optimal value function and policy.

Bonus:

You can intersperse evaluation and improvement steps with the same effect, just not with k=1, but any other multiple of that is fine.
The benefit is that you can get convergence on
Loading
before you converge on
Loading

Value Iteration

If you don't have a policy, but you do have a known optimal value at a specific state,

Loading

you can work the non-policy Bellman equations to do synchronous backups:

Loading

Values functions generated by this do not necessarily match any real policy (they are just numeric iterations)
But they do converge to the optimal value function, and thus the optimal policy

Action Iteration

Possible to do iterations similar to those above using action value iterations too. They let you get away from having to know the transition dynamics...

Summary

Problem	Bellman Equation	Algorithm
Prediction	Bellman Expectation Equation	Policy Evaluation
Control	Bellman Expectation Equation + Greedy Policy Improvement	Policy Iteration
Control	Bellman Optimality Equation	Value Iteration

Proofs

How do we know that value iteration converges to
Loading
?
Or that iterative policy evaluation converges to
Loading
?
And therefore that policy iteration converges to
Loading
?
Is the solution unique?
How fast do these algorithms converge?

These questions are resolved by the contraction mapping theorem.

Dynamic Programming

Resources

About

Policy Evaluation

Policy Evaluation & Improvement

Value Iteration

Action Iteration

Summary

Proofs