Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 20 Next »

Markov decision processes formally describe an environment for Reinforcement Learning when the environment is fully observable.

Resources

Definitions

A State  is Markov if and only if , i.e. "The future is independent of the past given the present".

Full observability: agent directly observes the environment state, . Formally, this is then a Markov Decision Process.

The State Transition Probability is defined by .

The State Transition Matrix defines all transition probabilities between states and successor states:

A Markov Process (or Markov Chain) is a tuple  where  is a finite set of states and  is a transition probability matrix.

An episode in the chain is a specific sequence of states, e.g. for the student markov chain below, .

Markov Reward Processes

Basic Definitions

A Markov Reward Process is a tuple  

  •  is a finite set of states
  •  is a transition probability matrix, 
  •  is a reward function, 
  •  is a discount factor, 

The Return  is the total discounted reward from time step t, which values immediate reward above delayed reward. Discounting avoids infinite returns and models uncertain futures. Most real world processes focus more on immediate returns.

The value function  is the expected return from a state s.

The Bellman Equation

Iteratively calculate the value function:

Or in matrix form:

  • Computationally explosive - , resort to dynamic programming and other methods to solve for large MDP's

Policies and Value Functions

  • It is a distribution over actions given states
  • Time independent, or stationary, (doesn't care what time step you are in)

The value functions can be considered in two separate ways:

  • Value Function  - how good it is to be in any state, i.e. the expected return from a given state
    • State Value Function  - expected return from a state following a specific policy 
    • Action Value Function  - expected return from a state, taking a specific action, and then following a specific policy 

Bellmans equation holds for both state and action value functions. Additionally, there are various forms of the Bellman Expectation Equation for both state and action value functions. e.g.

See the slides for more examples.

Getting a Handle

  • Any specific action has a probability transition matrix of it's own,  - this means an action can probabilistically get you to more than one other state. See the result of the 'Pub' action below.

  • No labels