Markov decision processes formally describe an environment for Reinforcement Learning when the environment is fully observable.

Resources

David Silver's Course Slides: '15
- 01-Introduction (slides, video)
- 02-Markov Decision Processes (slides, video)

Definitions

A State is Markov if and only if , i.e. "The future is independent of the past given the present".

Full observability: agent directly observes the environment state, . Formally, this is then a Markov Decision Process.

The State Transition Probability is defined by .

The State Transition Matrix defines all transition probabilities between states and successor states:

A Markov Process (or Markov Chain) is a tuple where is a finite set of states and is a transition probability matrix.

An episode in the chain is a specific sequence of states, e.g. for the student markov chain below, .

Markov Reward Processes

Basic Definitions

A Markov Reward Process is a tuple

is a finite set of states
is a transition probability matrix,
is a reward function,
is a discount factor,

The Return is the total discounted reward from time step t, which values immediate reward above delayed reward. Discounting avoids infinite returns and models uncertain futures. Most real world processes focus more on immediate returns.

The value function is the expected return from a state s.

The Bellman Equation

Iteratively calculate the value function:

Or in matrix form:

Computationally explosive - , resort to dynamic programming and other methods to solve for large MDP's

Policies and Value Functions

It is a distribution over actions given states
Time independent, or stationary, (doesn't care what time step you are in)

The value functions can be considered in two separate ways:

Value Function - how good it is to be in any state, i.e. the expected return from a given state
- State Value Function - expected return from a state following a specific policy
- Action Value Function - expected return from a state, taking a specific action, and then following a specific policy

Bellmans equation holds for both state and action value functions. Additionally, there are various forms of the Bellman Expectation Equation for both state and action value functions. e.g.

See the slides for more examples.

Getting a Handle

Any specific action has a probability transition matrix of it's own, - this means an action can probabilistically get you to more than one other state. See the result of the 'Pub' action below.

Markov Decision Processes