Excerpt
Markov decision processes formally describe an environment for Reinforcement Learning when the environment is fully observable.

Table of Contents

maxLevel	3
outline	true
type	flat
printable	false

Resources

David Silver's Course Slides: '15
- 01 - Introduction (slides, video)
- 02 - Markov Decision Processes (slides, video)
- 03 - Dynamic Programming (slides, video)
- 04 - Model Free Prediction (slides, video)
- 05 - Model Free Control (slides, video)

Prediction

These methods estimate the value function of an unknown MDP.

Monte Carlo

Learns directly from complete episodes of experience. Caveat: all episodes must terminate
Uses the concept 'value = empirical mean return'
Goal is to learn
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body v_{\pi}
from episodes of experience under policy
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \pi
Challenges:
- Make sure we visit all states over our episodes enough times sufficient for a mean to be approaching a limit

Method

...

Excerpt
Markov decision processes formally describe an environment for Reinforcement Learning when the environment is not observable.

...

Table of Contents

maxLevel	3
outline	true
type	flat
printable	false

Resources

David Silver's Course Slides: '15
- 01 - Introduction (slides, video)
- 02 - Markov Decision Processes (slides, video)
- 03 - Dynamic Programming (slides, video)
- 04 - Model Free Prediction (slides, video)
- 05 - Model Free Control (slides, video)

Prediction

These methods estimate the value function of an unknown MDP.

Monte Carlo

Learns directly from complete episodes of experience. Caveat: all episodes must terminate
Uses the concept 'value = empirical mean return'
Goal is to learn Increment the total return for the state
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body
N(s) \leftarrow N(s) + 1
v_{\pi}
from episodes of experience under policy By the law of large numbers
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body
S(s) \leftarrow S(s) + G_t
Value is estimated by the mean return
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body V(s) = S(s) / N(s)
- Often faster to implement with incremental means
\pi
Challenges:
- Make sure we visit all states over our episodes enough times sufficient for a mean to be approaching a limit

Method

Gather a set of episodes and run over each of them
- Each time a state is visited:
  - Increment a counter
    Mathinline
    host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body
V
- - N(s) \
rightarrow v_{\pi}
- - leftarrow N(s)
as
- - + 1
  - Increment the total return for the state
    Mathinline
    host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body
N
- - S(s) \
rightarrow \infty

The increments can be done on first visits only, or on every visit and the limiting results will still hold. Why one or the other?

Typically the value function is estimated and updated on the fly with incremental mean calculations:

...

- - leftarrow S(s) + G_t
Value is estimated by the mean return
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body V(

...

or over a moving window (slowly forget old episode contributions):

...

s) = S(s) / N(s)
- Often faster to implement with incremental means
By the law of large numbers
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body V(

...

s)

...

Temporal Difference Learning

Fundamentally different from Monte Carlo in that incomplete episodes are permissable. For the remainder of the return, a guess is provided to bootstrap the calculations.

...

\rightarrow v_{\pi}(s)
as
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

...

body N(s) \rightarrow \infty

The increments can be done on first visits only, or on every visit and the limiting results will still hold. Why one or the other?

Typically the value function is estimated and updated on the fly with incremental mean calculations:

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

V(S_t)) \\ &\leftarrow V(S_t) + \alpha ( R_{t+1} + \gamma V(S_{t+1})frac{1}{N(S_t)} ( G_t - V(S_t)) \\
\end{align}

...

or over a moving window (slowly forget old episode contributions):

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

...

V(S_t) \leftarrow V(S_t) + \alpha ( G_t - V(S_

...

There are specific cases where the bias causes misbehaviour

...

))

Temporal Difference Learning

Fundamentally different from Monte Carlo in that incomplete episodes are permissable. For the remainder of the return, a guess is provided to bootstrap the calculations.

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

...

Lends to an increased efficiency.

...

Comparisons

...

Image Removed

...

Image Removed

...

Image Removed

Control

How can we find the optimal value function and subsequently policy? The algorithms are mostly concerned with On/Off Policy Monte Carlo and Temporal-Difference learning.

On Policy - 'learning while on the job'
- Learn about policy
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \pi
  from experience sampled from
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \pi
Off Policy - 'look over someone else's shoulder', e.g. robot looking at a human
- Learn about policy
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \pi
  from experience sampled from
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \mu

On-Policy Monte-Carlo Iteration

Basic premise is to apply the same process that dynamic programming uses to greedily improve the policy.

Image Removed

But this has two problems:

...

Mathinline

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body	\mathcal{P}^a_{ss'}

...

\begin{align}
V(S_t) &\leftarrow V(S_t) + \alpha ( G_t - V(S_t)) \\
&\leftarrow V(S_t) + \alpha ( R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\
\end{align}

TD Target - is the estimated return
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body R_{t+1} + \gamma V(S_{t+1})
, this is considered a biased estimate as opposed to the Monte Carlo method
- There are specific cases where the bias causes misbehaviour
True TD Target - is the estimated return
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body R_{t+1} + \gamma v_{\pi}(S_{t+1})
, this is considered an unbiased estimate
Exploits the Markov Properties of the system (i.e. uses some knowledge about the system to provide a biased estimate.
- Lends to an increased efficiency.
TD(lambda) - allows you to choose how far ahead to use returns before using a biased estimate (moving between shallow and deep backups).

Comparisons

Property	Dynamic Programming	Monte Carlo	TD
Episodes	No	Yes, terminating	Yes, may be incomplete
Bootstrapping	Yes	No	Yes
Sampling	No	Yes	Yes
Initial Value	Not sensitive	Not sensitive	Sensitive
Bias	No Bias	No Bias	Bias estimate
Nosie/Variance	None	Lots (many R's in the measured G_t)	Low (one R before the bias estimate)
Batching	-	Converges to a best fit for the observations	Converges to a max likelihood given all the data
		More effective in non-Markov environments	More efficient in Markov environments

Image Added	Image Added	Image Added

Image Added

Control

How can we find the optimal value function and subsequently policy? The algorithms are mostly concerned with On/Off Policy Monte Carlo and Temporal-Difference learning.

On Policy - 'learning while on the job'
- Learn about policy
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \pi
  from experience sampled from
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \pi
Off Policy - 'look over someone else's shoulder', e.g. robot looking at a human
- Learn about policy
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \pi
  from experience sampled from
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \mu

On-Policy Monte-Carlo Iteration

Basic premise is to apply the same process that dynamic programming uses to greedily improve the policy.

Image Added

But this has two problems:

Value Function: extracting the policy from the value function requires knowledge of the transition dynamics,
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \mathcal{P}^a_{ss'}
- Solution: use Q instead of V since it doesn't need to know the transitions

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

\pi'(s) = argmax_{a \in A} \mathcal{R}^a_{s} + \mathcal{P}^a_{ss'}V(s') \hspace{1cm} \rightarrow \hspace{1cm} \pi'(s) = argmax_{a \in \mathcal{A}} Q(s,a)

Image Added

Policies: no guarantee that the greedy policy improvements generate policies that can sufficiently explore the space
- Solution: introduce some randomness to the greedy update (note m is the number of actions to be tried) ... epsilon-greedy!
  - This satisfies the policy improvement theorem, i.e. everything gets better
  - Hard to do better than this naive method

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

\pi'(a|s) = argmax_{a \in A} \mathcal{R}^a_{s} + \mathcal{P}^a_{ss'}V(s') \hspace{1cm} \rightarrow \hspace{1cm} \pi'(s)\begin{cases}
\epsilon/m + 1 - \epsilon & \mathrm{if} \; a^* = argmax_{a \in \mathcal{A}} Q(s,a)

Image Removed

Policies: no guarantee that the greedy policy improvements generate policies that can sufficiently explore the space
- Solution: introduce some randomness to the greedy update (note m is the number of actions to be tried) ... epsilon-greedy!
  - This satisfies the policy improvement theorem, i.e. everything gets better
  - Hard to do better than this naive method

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

\pi(a|s) = \begin{cases}
\epsilon/m + 1 - \epsilon & \mathrm{if} \; a^* = argmax_{a \in A} Q(s,a) \\
\epsilon/m &  \mathrm{otherwise} \\
\end{cases}

Image Removed

Improvements
- Update the policy after every episode, speedup!

Image Removed

Greedy in the Limit with Infinite Exploration (GLIE)
- All state-action pairs are explored infinitely many times
- The policy converges ona greedy policy
Guarantee GLIE
- Evaluation: Something to do with sampling and counting?
- Improvement: Make sure epsilon goes to zero as the update count goes up

On-Policy TD (Sarsa)

Image RemovedImage Removed

...

 \\
\epsilon/m &  \mathrm{otherwise} \\
\end{cases}

Image Added

Improvements
- Update the policy after every episode, speedup!

Image Added

Greedy in the Limit with Infinite Exploration (GLIE)
- All state-action pairs are explored infinitely many times
- The policy converges ona greedy policy
Guarantee GLIE
- Evaluation: Something to do with sampling and counting?
- Improvement: Make sure epsilon goes to zero as the update count goes up

On-Policy TD (Sarsa)

Image AddedImage Added

Convergence - guaranteed if policies are GLIE and step sizes satisfy the Robbins-Munro properties for a sequence
- Robbins-Munro -
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \sum_i \alpha_i = \infty
  ,
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \sum_i \alpha_i^2 < \infty
Can consider n-step returns as we did for TD prediction, what is the best n to use? Sarsa(lambda)! Also can utilise eligibility traces

The lambda approach can help you update more quickly over one episode. See below.

Image Added

Off-Policy Learning

Goal

Evaluate target policy
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \
sum_i \alpha_i = \infty,
pi(a|s)
to compute
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body
\sum_i \alpha_i^2 < \infty
Can consider n-step returns as we did for TD prediction, what is the best n to use? Sarsa(lambda)! Also can utilise eligibility traces

The lambda approach can help you update more quickly over one episode. See below.

Image Removed

Off-Policy Learning

Goal

Evaluate target
v_{\pi}(s)
or
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body q_{\pi}(s,a)
while following behaviour policy
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \pimu(a|s)
to compute
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body v_{\pi}(s)
or

Why?

Learn from watching humans
Re-use experience generated from old policies - cache/batch old episodes
Learn about optimal policy while following exploratory policy - make sure we explore the state space sufficiently
Learn about multiple policies while following one policy

Importance Sampling - has high variance (noise) for Monte-Carlo, but can be made to work for TD.
while following behaviour policy Target Policy
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body
q_{\pi}(s,a)
\pi
- is the policy we are trying to optimise
Behaviour Policy
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body\mu(a|s)

Why?

Learn from watching humans
Re-use experience generated from old policies - cache/batch old episodes
Learn about optimal policy while following exploratory policy - make sure we explore the state space sufficiently
Learn about multiple policies while following one policy

Importance sampling has high variance (noise) for Monte-Carlo, but can be made to work for TD.
body \mu
- is the policy we are trying to improve

Off-Policy Q-Learning

Take the next step from the observed policy, but then consider bootstrapping the rest from the optimal policy

...

Is the reward the reward gained from
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body A_{t+1}
?
Use an epsilon-greedy policy for the behavioural policy (exploration) and greedy policy for target policy (optimal)

Image Added

Versions Compared

Old Version 31

New Version Current

Key

Resources

Prediction

Monte Carlo

Method

Resources

Prediction

Monte Carlo

Method

Temporal Difference Learning

Temporal Difference Learning

Comparisons

Control

On-Policy Monte-Carlo Iteration

Comparisons

Control

On-Policy Monte-Carlo Iteration

On-Policy TD (Sarsa)

On-Policy TD (Sarsa)

Off-Policy Learning

Off-Policy Learning

Off-Policy Q-Learning

Page Comparison

Versions Compared

Old Version 31

New Version Current

Key

Resources

Prediction

Monte Carlo

Method

Resources

Prediction

Monte Carlo

Method

Temporal Difference Learning

Temporal Difference Learning

Comparisons

Control

On-Policy Monte-Carlo Iteration

Comparisons

Control

On-Policy Monte-Carlo Iteration

On-Policy TD (Sarsa)

On-Policy TD (Sarsa)

Off-Policy Learning

Off-Policy Learning

Off-Policy Q-Learning