Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt

Markov decision processes formally describe an environment for Reinforcement Learning when the environment is fully observable.

Table of Contents
maxLevel3
outlinetrue
typeflat
printablefalse

Resources

Prediction

These methods estimate the value function of an unknown MDP.

Monte Carlo

  • Learns directly from complete episodes of experience. Caveat: all episodes must terminate
  • Uses the concept 'value = empirical mean return'
  • Goal is to learn
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyv_{\pi}
     from episodes of experience under policy
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi
  • Challenges:
    • Make sure we visit all states over our episodes enough times sufficient for a mean to be approaching a limit

Method

...


Excerpt

Markov decision processes formally describe an environment for Reinforcement Learning when the environment is not observable.

...

Table of Contents
maxLevel3
outlinetrue
typeflat
printablefalse

Resources

Prediction

These methods estimate the value function of an unknown MDP.

Monte Carlo

  • Learns directly from complete episodes of experience. Caveat: all episodes must terminate
  • Uses the concept 'value = empirical mean return'
  • Goal is to learn Increment the total return for the state 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body
    N(s) \leftarrow N(s) + 1
    v_{\pi}
     from episodes of experience under policy Value is estimated by the mean return  Mathinlinehost5cf3c9eb-
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body
    S(s) \leftarrow S(s) + G_t
    \pi
  • Challenges:
    • Make sure we visit all states over our episodes enough times sufficient for a mean to be approaching a limit

Method

  • Gather a set of episodes and run over each of them
    • Each time a state is visited:
      • Increment a counter 
        Mathinline
        host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
        body
    V
      • N(s)
    = S(s) /
      • \leftarrow N(s)
    By the law of large numbers 
      • + 1
      • Increment the total return for the state 
        Mathinline
        host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
        body
    V
      • S(s) \
    rightarrow v_{\pi}
      • leftarrow S(s)
     as 
      • + G_t
  • Value is estimated by the mean return 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyV(s) = S(s) / N(s)
    \rightarrow \infty

The increments can be done on first visits only, or on every visit and the limiting results will still hold. Why one or the other?

Typically the value function is estimated and updated on the fly with incremental mean calculations:

...

  • By the law of large numbers 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyV(

...

  • s)

...

or  over a moving window (slowly forget old episode contributions):

...

  • \rightarrow v_{\pi}(s)
     as 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body

...

  • N(

...

  • s)

...

Temporal Difference Learning

...

  • \rightarrow \infty

The increments can be done on first visits only, or on every visit and the limiting results will still hold. Why one or the other?

Typically the value function is estimated and updated on the fly with incremental mean calculations:


Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
\begin{align}
V(S_t) &\leftarrow V(S_t) + \alphafrac{1}{N(S_t)} ( G_t - V(S_t)) \\
&\leftarrow V(

or  over a moving window (slowly forget old episode contributions):

Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
V(S_t) + \alphaleftarrow V( RS_{t+1}) + \gammaalpha V(S G_{t+1}) - V(S_t)) \\
\end{align}

...

Temporal Difference Learning

Fundamentally different from Monte Carlo in that incomplete episodes are permissable. For the remainder of the return, a guess is provided to bootstrap the calculations.

Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

...

\begin{align}
V(S_t) &\leftarrow V(S_

...

t

...

  • There are specific cases where the bias causes misbehaviour

...

) + \alpha ( G_t - V(S_t)) \\
&\leftarrow V(S_t) + \alpha ( R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\
\end{align}
  • TD Target - is the estimated return 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyR_{t+1} + \gamma V(S_{t+1})
    , this is considered a biased estimate as opposed to the Monte Carlo method
    • There are specific cases where the bias causes misbehaviour
  • True TD Target - is the estimated return 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyR_{t+1} + \gamma v_{\pi}(S_{t+1})
    , this is considered an unbiased estimate
  • Exploits the Markov Properties of the system (i.e. uses some knowledge about the system to provide a biased estimate.
    • Lends to an increased efficiency.
  • TD(lambda) - allows you to choose how far ahead to use returns before using a biased estimate (moving between shallow and deep backups).

Comparisons

PropertyDynamic ProgrammingMonte CarloTD
EpisodesNoYes, terminatingYes, may be incomplete
BootstrappingYesNoYes
SamplingNoYesYes
Initial ValueNot sensitiveNot sensitiveSensitive
BiasNo BiasNo BiasBias estimate
Nosie/VarianceNoneLots (many R's in the measured G_t)Low (one R before the bias estimate)
Batching-Converges to a best fit for the observationsConverges to a max likelihood given all the data


More effective in non-Markov environmentsMore efficient in Markov environments


Image Added
Image Added
Image Added

Image Added

Control

How can we find the optimal value function and subsequently policy? The algorithms are mostly concerned with On/Off Policy Monte Carlo and Temporal-Difference learning.

  • On Policy - 'learning while on the job'
    • Learn about policy
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\pi
      from experience sampled from 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\pi
  • Off Policy - 'look over someone else's shoulder', e.g. robot looking at a human
    • Learn about policy
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body
    R_{t+1} +
    • \
    gamma v_{\pi}(S_{t+1}), this is considered an unbiased estimate
  • Exploits the Markov Properties of the system (i.e. uses some knowledge about the system to provide a biased estimate.
    • Lends to an increased efficiency.
  • TD(lambda) - allows you to choose how far ahead to use returns before using a biased estimate (moving between shallow and deep backups).

Comparisons

...

Image Removed

...

Image Removed

...

Image Removed

Image Removed

Control

How can we find the optimal value function and subsequently policy? The algorithms are mostly concerned with On/Off Policy Monte Carlo and Temporal-Difference learning.

  • On Policy - 'learning while on the job'
    • Learn about policy
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\pi
      from experience sampled from mathinline
      pi
      from experience sampled from 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\mu

On-Policy Monte-Carlo Iteration

Basic premise is to apply the same process that dynamic programming uses to greedily improve the policy.

Image Added

But this has two problems:

  • Value Function: extracting the policy from the value function requires knowledge of the transition dynamics, 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mathcal{P}^a_{ss'}
    • Solution: use Q instead of V since it doesn't need to know the transitions
Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
\pi'(s) = argmax_{a \in A} \mathcal{R}^a_{s} + \mathcal{P}^a_{ss'}V(s') \hspace{1cm} \rightarrow \hspace{1cm} \pi'(s) = argmax_{a \in \mathcal{A}} Q(s,a)

Image Added

  • Policies: no guarantee that the greedy policy improvements generate policies that can sufficiently explore the space
    • Solution: introduce some randomness to the greedy update (note m is the number of actions to be tried) ... epsilon-greedy!
      • This satisfies the policy improvement theorem, i.e. everything gets better
      • Hard to do better than this naive method
Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

...

  • Learn about policy
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi
    from experience sampled from 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mu

On-Policy Monte-Carlo Iteration

Basic premise is to apply the same process that dynamic programming uses to greedily improve the policy.

Image Removed

But this has two problems:

  • Value Function: extracting the policy from the value function requires knowledge of the transition dynamics, 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mathcal{P}^a_{ss'}
    • Solution: use Q instead of V since it doesn't need to know the transitions
Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
\pi'(s) = argmax_{a \in A} \mathcal{R}^a_{s} + \mathcal{P}^a_{ss'}V(s') \hspace{1cm} \rightarrow \hspace{1cm} \pi'(s) = argmax_{a \in \mathcal{A}} Q(s,a)

Image Removed

  • Policies: no guarantee that the greedy policy improvements generate policies that can sufficiently explore the space
    • Solution: introduce some randomness to the greedy update (note m is the number of actions to be tried) ... epsilon-greedy!
      • This satisfies the policy improvement theorem, i.e. everything gets better
      • Hard to do better than this naive method
mathblock
\pi(a|s) = \begin{cases}
\epsilon/m + 1 - \epsilon & \mathrm{if} \; a^* = argmax_{a \in A} Q(s,a) \\
\epsilon/m &  \mathrm{otherwise} \\
\end{cases}

Image Added

  • Improvements
    • Update the policy after every episode, speedup!

Image Added

  • Greedy in the Limit with Infinite Exploration (GLIE)
    • All state-action pairs are explored infinitely many times
    • The policy converges ona greedy policy
  • Guarantee GLIE
    • Evaluation: Something to do with sampling and counting?
    • Improvement: Make sure epsilon goes to zero as the update count goes up

On-Policy TD (Sarsa)

Image AddedImage Added

  • Convergence - guaranteed if policies are GLIE and step sizes satisfy the Robbins-Munro properties for a sequence
    • Robbins-Munro - 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\sum_i \alpha_i = \infty
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\sum_i \alpha_i^2 < \infty
  • Can consider n-step returns as we did for TD prediction, what is the best n to use? Sarsa(lambda)! Also can utilise eligibility traces


The lambda approach can help you update more quickly over one episode. See below.

Image Added

Off-Policy Learning

Goal

  • Evaluate target policy 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi(a|s)

...

Image Removed

  • Improvements
    • Update the policy after every episode, speedup!

Image Removed

  • Greedy in the Limit with Infinite Exploration (GLIE)
    • All state-action pairs are explored infinitely many times
    • The policy converges ona greedy policy
  • Guarantee GLIE
    • Evaluation: Something to do with sampling and counting?
    • Improvement: Make sure epsilon goes to zero as the update count goes up

On-Policy TD (Sarsa)

Image RemovedImage Removed

  • Convergence - guaranteed if policies are GLIE and step sizes satisfy the Robbins-Munro properties for a sequenceRobbins-Munro - 
     to compute 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyv_{\pi}(s)
     or 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyq_{\pi}(s,a)
     while following behaviour policy 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mu(a|s)

Why?

  • Learn from watching humans
  • Re-use experience generated from old policies - cache/batch old episodes
  • Learn about optimal policy while following exploratory policy - make sure we explore the state space sufficiently
  • Learn about multiple policies while following one policy


  • Importance Sampling - has high variance (noise) for Monte-Carlo, but can be made to work for TD.
  • Target Policy
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\
    sum_i \alpha_i = \infty, 
    pi
     - is the policy we are trying to optimise
  • Behaviour Policy
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\
    sum_i \alpha_i^2 < \infty
  • Can consider n-step returns as we did for TD prediction, what is the best n to use? Sarsa(lambda)! Also can utilise eligibility traces

The lambda approach can help you update more quickly over one episode. See below.

Image Removed

Off-Policy Learning

Goal

  • Evaluate target policy 
    mu
     - is the policy we are trying to improve

Off-Policy Q-Learning

Take the next step from the observed policy, but then consider bootstrapping the rest from the optimal policy

  • Next action: 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi(a|s)
     to compute 
    A_{t+1} \sim \mu(\cdot|S_t)
  • Alternative successor action: mathinline
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyv_{\pi}(s)
     or 
    A' \sim \pi(\cdot|S_t)
  • Update towards value of the alternative action:
Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

...

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t))
  • Is the reward the reward gained from 
  • Learn from watching humans
  • Re-use experience generated from old policies - cache/batch old episodes
  • Learn about optimal policy while following exploratory policy - make sure we explore the state space sufficiently
  • Learn about multiple policies while following one policy
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mu(a|s)

Why?

  • A_{t+1}
    ?
  • Use an epsilon-greedy policy for the behavioural policy (exploration) and greedy policy for target policy (optimal)

Image Added