Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt

Markov decision processes formally describe an environment for Reinforcement Learning when the environment is fully observable.

...

Prediction

These methods estimate the value function of an unknown MDP.

Monte Carlo

  • Learns directly from complete episodes of experience. Caveat: all episodes must terminate
  • Uses the concept 'value = empirical mean return'
  • Goal is to learn
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyv_{\pi}
     from episodes of experience under policy
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi
  • Challenges:
    • Make sure we visit all states over our episodes enough times sufficient for a mean to be approaching a limit

Method

  • Gather a set of episodes and run over each of them
    • Each time a state is visited:
      • Increment a counter 
        Mathinline
        host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
        bodyN(s) \leftarrow N(s) + 1
      • Increment the total return for the state 
        Mathinline
        host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
        bodyS(s) \leftarrow S(s) + G_t
  • Value is estimated by the mean return 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyV(s) = S(s) / N(s)
  • By the law of large numbers 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyV(s) \rightarrow v_{\pi}(s)
     as 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyN(s) \rightarrow \infty

The increments can be done on first visits only, or on every visit and the limiting results will still hold. Why one or the other?

Typically the value function is estimated and updated on the fly with incremental mean calculations:


Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} ( G_t - V(S_t))

or  over a moving window (slowly forget old episode contributions):

Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
V(S_t) \leftarrow V(S_t) + \alpha ( G_t - V(S_t))

...

  • Greedy in the Limit with Infinite Exploration (GLIE)
    • All state-action pairs are explored infinitely many times
    • The policy converges ona greedy policy
  • Guarantee GLIE
    • Evaluation: Something to do with sampling and counting?
    • Improvement: Make sure epsilon goes to zero as the update count goes up

On-Policy TD (Sarsa)

  • Convergence - guaranteed if policies are GLIE and step sizes satisfy the Robbins-Munro properties for a sequence
    • Robbins-Munro - 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\sum_i \alpha_i = \infty
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\sum_i \alpha_i^2 < \infty
  • Can consider n-step returns as we did for TD prediction, what is the best n to use? Sarsa(lambda)! Also can utilise eligibility traces


The lambda approach can help you update more quickly over one episode. See below.

Off-Policy Learning

Goal

  • Evaluate target policy 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi(a|s)
     to compute 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyv_{\pi}(s)
     or 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyq_{\pi}(s,a)
     while following behaviour policy 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mu(a|s)

...

  • Importance Sampling - has high variance (noise) for Monte-Carlo, but can be made to work for TD.
  • Target Policy
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi
     - is the policy we are trying to optimise
  • Behaviour Policy
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mu
     - is the policy we are trying to improve

Off-Policy Q-Learning

Take the next step from the observed policy, but then consider bootstrapping the rest from the optimal policy

...

  • Is the reward the reward gained from 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyA_{t+1}
    ?
  • Use an epsilon-greedy policy for the behavioural policy (exploration) and greedy policy for target policy (optimal)

Image Added