Excerpt |
---|
Markov decision processes formally describe an environment for Reinforcement Learning when the environment is fully observable. |
...
Prediction
These methods estimate the value function of an unknown MDP.
Monte Carlo
- Learns directly from complete episodes of experience. Caveat: all episodes must terminate
- Uses the concept 'value = empirical mean return'
- Goal is to learn
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | v_{\pi} |
---|
|
from episodes of experience under policy Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | \pi |
---|
|
- Challenges:
- Make sure we visit all states over our episodes enough times sufficient for a mean to be approaching a limit
Method
- Gather a set of episodes and run over each of them
- Each time a state is visited:
- Increment a counter
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | N(s) \leftarrow N(s) + 1 |
---|
|
- Increment the total return for the state
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | S(s) \leftarrow S(s) + G_t |
---|
|
- Value is estimated by the mean return
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | V(s) = S(s) / N(s) |
---|
|
- By the law of large numbers
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | V(s) \rightarrow v_{\pi}(s) |
---|
|
as Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | N(s) \rightarrow \infty |
---|
|
The increments can be done on first visits only, or on every visit and the limiting results will still hold. Why one or the other?
Typically the value function is estimated and updated on the fly with incremental mean calculations:
Mathblock |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
|
V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} ( G_t - V(S_t)) |
or over a moving window (slowly forget old episode contributions):
Mathblock |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
|
V(S_t) \leftarrow V(S_t) + \alpha ( G_t - V(S_t)) |
...
- Greedy in the Limit with Infinite Exploration (GLIE)
- All state-action pairs are explored infinitely many times
- The policy converges ona greedy policy
- Guarantee GLIE
- Evaluation: Something to do with sampling and counting?
- Improvement: Make sure epsilon goes to zero as the update count goes up
On-Policy TD (Sarsa)
- Convergence - guaranteed if policies are GLIE and step sizes satisfy the Robbins-Munro properties for a sequence
- Robbins-Munro -
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | \sum_i \alpha_i = \infty |
---|
|
, Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | \sum_i \alpha_i^2 < \infty |
---|
|
- Can consider n-step returns as we did for TD prediction, what is the best n to use? Sarsa(lambda)! Also can utilise eligibility traces
The lambda approach can help you update more quickly over one episode. See below.
Off-Policy Learning
Goal
- Evaluate target policy
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | \pi(a|s) |
---|
|
to compute Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | v_{\pi}(s) |
---|
|
or Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | q_{\pi}(s,a) |
---|
|
while following behaviour policy Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | \mu(a|s) |
---|
|
...
- Learn from watching humans
- Re-use experience generated from old policies - cache/batch old episodes
- Learn about optimal policy while following exploratory policy - make sure we explore the state space sufficiently
- Learn about multiple policies while following one policy
- Importance sampling has high variance (noise) for Monte-Carlo, but can be made to work for TD.
Off-Policy Q-Learning
Take the next step from the observed policy, but then consider bootstrapping the rest from the optimal policy
- Next action:
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | A_{t+1} \sim \mu(\cdot|S_t) |
---|
|
- Alternative successor action:
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | A' \sim \pi(\cdot|S_t) |
---|
|
- Update towards value of the alternative action:
Mathblock |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
|
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t)) |
- Is the reward the reward gained from
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | A_{t+1} |
---|
|
?