Excerpt
Markov decision processes formally describe an environment for Reinforcement Learning when the environment is fully observable.

David Silver's Course Slides: '15
- 01 - Introduction (slides, video)
- 02 - Markov Decision Processes (slides, video)
- 03 - Dynamic Programming (slides, video)
- 04 - Model Free Prediction (slides, video)
- 05 - Model Free Control (slides, video)

Prediction

These methods estimate the value function of an unknown MDP.

Monte Carlo

Learns directly from complete episodes of experience. Caveat: all episodes must terminate
Uses the concept 'value = empirical mean return'
Goal is to learn
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body v_{\pi}
from episodes of experience under policy
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \pi
Challenges:
- Make sure we visit all states over our episodes enough times sufficient for a mean to be approaching a limit

Method

Gather a set of episodes and run over each of them
- Each time a state is visited:
  - Increment a counter
    Mathinline
    host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body N(s) \leftarrow N(s) + 1
  - Increment the total return for the state
    Mathinline
    host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body S(s) \leftarrow S(s) + G_t
Value is estimated by the mean return
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body V(s) = S(s) / N(s)
- Often faster to implement with incremental means
By the law of large numbers
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body V(s) \rightarrow v_{\pi}(s)
as
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body N(s) \rightarrow \infty

The increments can be done on first visits only, or on every visit and the limiting results will still hold. Why one or the other?

Typically the value function is estimated and updated on the fly with incremental mean calculations:

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} ( G_t - V(S_t))

or over a moving window (slowly forget old episode contributions):

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

V(S_t) \leftarrow V(S_t) + \alpha ( G_t - V(S_t))

...

Greedy in the Limit with Infinite Exploration (GLIE)
- All state-action pairs are explored infinitely many times
- The policy converges ona greedy policy
Guarantee GLIE
- Evaluation: Something to do with sampling and counting?
- Improvement: Make sure epsilon goes to zero as the update count goes up

On-Policy TD (Sarsa)

Convergence - guaranteed if policies are GLIE and step sizes satisfy the Robbins-Munro properties for a sequence
- Robbins-Munro -
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \sum_i \alpha_i = \infty
  ,
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \sum_i \alpha_i^2 < \infty
Can consider n-step returns as we did for TD prediction, what is the best n to use? Sarsa(lambda)! Also can utilise eligibility traces

The lambda approach can help you update more quickly over one episode. See below.

Off-Policy Learning

Goal

Evaluate target policy
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \pi(a|s)
to compute
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body v_{\pi}(s)
or
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body q_{\pi}(s,a)
while following behaviour policy
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \mu(a|s)

...

Learn from watching humans
Re-use experience generated from old policies - cache/batch old episodes
Learn about optimal policy while following exploratory policy - make sure we explore the state space sufficiently
Learn about multiple policies while following one policy

Importance sampling has high variance (noise) for Monte-Carlo, but can be made to work for TD.

Off-Policy Q-Learning

Take the next step from the observed policy, but then consider bootstrapping the rest from the optimal policy

Next action:
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body A_{t+1} \sim \mu(\cdot|S_t)
Alternative successor action:
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body A' \sim \pi(\cdot|S_t)
Update towards value of the alternative action:

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t))

Is the reward the reward gained from
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body A_{t+1}
?

Versions Compared

Old Version 30

New Version 31

Key

Prediction

Monte Carlo

Method

On-Policy TD (Sarsa)

Off-Policy Learning

Off-Policy Q-Learning

Page Comparison

Versions Compared

Old Version 30

New Version 31

Key

Prediction

Monte Carlo

Method

On-Policy TD (Sarsa)

Off-Policy Learning

Off-Policy Q-Learning