Excerpt |
---|
Markov decision processes formally describe an environment for Reinforcement Learning when the environment is fully observable. |
Table of Contents | ||||||||
---|---|---|---|---|---|---|---|---|
|
Resources
Prediction
These methods estimate the value function of an unknown MDP.
Monte Carlo
- Learns directly from complete episodes of experience. Caveat: all episodes must terminate
- Uses the concept 'value = empirical mean return'
- Goal is to learn
from episodes of experience under policyMathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body v_{\pi} Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \pi - Challenges:
- Make sure we visit all states over our episodes enough times sufficient for a mean to be approaching a limit
Method
...
Excerpt |
---|
Markov decision processes formally describe an environment for Reinforcement Learning when the environment is not observable. |
...
Table of Contents | ||||||||
---|---|---|---|---|---|---|---|---|
|
Resources
Prediction
These methods estimate the value function of an unknown MDP.
Monte Carlo
- Learns directly from complete episodes of experience. Caveat: all episodes must terminate
- Uses the concept 'value = empirical mean return'
- Goal is to learn Increment the total return for the state
Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body N(s) \leftarrow N(s) + 1
from episodes of experience under policy By the law of large numbersv_{\pi} Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body S(s) \leftarrow S(s) + G_t - Value is estimated by the mean return
Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body V(s) = S(s) / N(s) - Often faster to implement with incremental means
\pi - Challenges:
- Make sure we visit all states over our episodes enough times sufficient for a mean to be approaching a limit
Method
- Gather a set of episodes and run over each of them
- Each time a state is visited:
- Increment a counter
Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body
- Increment a counter
N(s) \
leftarrow N(s)
+ 1 - Increment the total return for the state
Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body
S(s) \
- Each time a state is visited:
The increments can be done on first visits only, or on every visit and the limiting results will still hold. Why one or the other?
Typically the value function is estimated and updated on the fly with incremental mean calculations:
...
leftarrow S(s) + G_t
- Value is estimated by the mean return
Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body V(
...
or over a moving window (slowly forget old episode contributions):
...
s) = S(s) / N(s) - Often faster to implement with incremental means
- By the law of large numbers
Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body V(
...
s)
...
Temporal Difference Learning
Fundamentally different from Monte Carlo in that incomplete episodes are permissable. For the remainder of the return, a guess is provided to bootstrap the calculations.
...
as\rightarrow v_{\pi}(s) Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
...
body N(s) \rightarrow \infty
The increments can be done on first visits only, or on every visit and the limiting results will still hold. Why one or the other?
Typically the value function is estimated and updated on the fly with incremental mean calculations:
Mathblock | ||
---|---|---|
| ||
V(S_t)) \\ &\leftarrow V(S_t) + \alpha ( R_{t+1} + \gamma V(S_{t+1})frac{1}{N(S_t)} ( G_t - V(S_t)) \\ \end{align} |
...
or over a moving window (slowly forget old episode contributions):
Mathblock | ||
---|---|---|
|
...
V(S_t) \leftarrow V(S_t) + \alpha ( G_t - V(S_ |
...
t |
...
- There are specific cases where the bias causes misbehaviour
...
)) |
Temporal Difference Learning
Fundamentally different from Monte Carlo in that incomplete episodes are permissable. For the remainder of the return, a guess is provided to bootstrap the calculations.
Mathblock | ||
---|---|---|
|
...
- Lends to an increased efficiency.
...
Comparisons
...
...
...
Control
How can we find the optimal value function and subsequently policy? The algorithms are mostly concerned with On/Off Policy Monte Carlo and Temporal-Difference learning.
- On Policy - 'learning while on the job'
- Learn about policy
from experience sampled fromMathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \pi Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \pi
- Learn about policy
- Off Policy - 'look over someone else's shoulder', e.g. robot looking at a human
- Learn about policy
from experience sampled fromMathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \pi Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \mu
- Learn about policy
On-Policy Monte-Carlo Iteration
Basic premise is to apply the same process that dynamic programming uses to greedily improve the policy.
But this has two problems:
...
Mathinline | ||||
---|---|---|---|---|
|
...
\begin{align}
V(S_t) &\leftarrow V(S_t) + \alpha ( G_t - V(S_t)) \\
&\leftarrow V(S_t) + \alpha ( R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\
\end{align} |
- TD Target - is the estimated return
, this is considered a biased estimate as opposed to the Monte Carlo methodMathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body R_{t+1} + \gamma V(S_{t+1}) - There are specific cases where the bias causes misbehaviour
- True TD Target - is the estimated return
, this is considered an unbiased estimateMathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body R_{t+1} + \gamma v_{\pi}(S_{t+1}) - Exploits the Markov Properties of the system (i.e. uses some knowledge about the system to provide a biased estimate.
- Lends to an increased efficiency.
- TD(lambda) - allows you to choose how far ahead to use returns before using a biased estimate (moving between shallow and deep backups).
Comparisons
Property | Dynamic Programming | Monte Carlo | TD |
---|---|---|---|
Episodes | No | Yes, terminating | Yes, may be incomplete |
Bootstrapping | Yes | No | Yes |
Sampling | No | Yes | Yes |
Initial Value | Not sensitive | Not sensitive | Sensitive |
Bias | No Bias | No Bias | Bias estimate |
Nosie/Variance | None | Lots (many R's in the measured G_t) | Low (one R before the bias estimate) |
Batching | - | Converges to a best fit for the observations | Converges to a max likelihood given all the data |
More effective in non-Markov environments | More efficient in Markov environments |
Control
How can we find the optimal value function and subsequently policy? The algorithms are mostly concerned with On/Off Policy Monte Carlo and Temporal-Difference learning.
- On Policy - 'learning while on the job'
- Learn about policy
from experience sampled fromMathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \pi Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \pi
- Learn about policy
- Off Policy - 'look over someone else's shoulder', e.g. robot looking at a human
- Learn about policy
from experience sampled fromMathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \pi Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \mu
- Learn about policy
On-Policy Monte-Carlo Iteration
Basic premise is to apply the same process that dynamic programming uses to greedily improve the policy.
But this has two problems:
- Value Function: extracting the policy from the value function requires knowledge of the transition dynamics,
Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \mathcal{P}^a_{ss'} - Solution: use Q instead of V since it doesn't need to know the transitions
Mathblock | ||
---|---|---|
| ||
\pi'(s) = argmax_{a \in A} \mathcal{R}^a_{s} + \mathcal{P}^a_{ss'}V(s') \hspace{1cm} \rightarrow \hspace{1cm} \pi'(s) = argmax_{a \in \mathcal{A}} Q(s,a) |
- Policies: no guarantee that the greedy policy improvements generate policies that can sufficiently explore the space
- Solution: introduce some randomness to the greedy update (note m is the number of actions to be tried) ... epsilon-greedy!
- This satisfies the policy improvement theorem, i.e. everything gets better
- Hard to do better than this naive method
- Solution: introduce some randomness to the greedy update (note m is the number of actions to be tried) ... epsilon-greedy!
Mathblock | ||
---|---|---|
| ||
\pi'(a|s) = argmax_{a \in A} \mathcal{R}^a_{s} + \mathcal{P}^a_{ss'}V(s') \hspace{1cm} \rightarrow \hspace{1cm} \pi'(s)\begin{cases} \epsilon/m + 1 - \epsilon & \mathrm{if} \; a^* = argmax_{a \in \mathcal{A}} Q(s,a) |
- Policies: no guarantee that the greedy policy improvements generate policies that can sufficiently explore the space
- Solution: introduce some randomness to the greedy update (note m is the number of actions to be tried) ... epsilon-greedy!
- This satisfies the policy improvement theorem, i.e. everything gets better
- Hard to do better than this naive method
- Solution: introduce some randomness to the greedy update (note m is the number of actions to be tried) ... epsilon-greedy!
Mathblock | ||
---|---|---|
| ||
\pi(a|s) = \begin{cases}
\epsilon/m + 1 - \epsilon & \mathrm{if} \; a^* = argmax_{a \in A} Q(s,a) \\
\epsilon/m & \mathrm{otherwise} \\
\end{cases}
|
- Improvements
- Update the policy after every episode, speedup!
- Greedy in the Limit with Infinite Exploration (GLIE)
- All state-action pairs are explored infinitely many times
- The policy converges ona greedy policy
- Guarantee GLIE
- Evaluation: Something to do with sampling and counting?
- Improvement: Make sure epsilon goes to zero as the update count goes up
On-Policy TD (Sarsa)
...
\\
\epsilon/m & \mathrm{otherwise} \\
\end{cases}
|
- Improvements
- Update the policy after every episode, speedup!
- Greedy in the Limit with Infinite Exploration (GLIE)
- All state-action pairs are explored infinitely many times
- The policy converges ona greedy policy
- Guarantee GLIE
- Evaluation: Something to do with sampling and counting?
- Improvement: Make sure epsilon goes to zero as the update count goes up
On-Policy TD (Sarsa)
- Convergence - guaranteed if policies are GLIE and step sizes satisfy the Robbins-Munro properties for a sequence
- Robbins-Munro -
,Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \sum_i \alpha_i = \infty Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \sum_i \alpha_i^2 < \infty
- Robbins-Munro -
- Can consider n-step returns as we did for TD prediction, what is the best n to use? Sarsa(lambda)! Also can utilise eligibility traces
The lambda approach can help you update more quickly over one episode. See below.
Off-Policy Learning
Goal
- Evaluate target policy
sum_i \alpha_i = \infty,Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \
to computepi(a|s) Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \sum_i \alpha_i^2 < \infty - Can consider n-step returns as we did for TD prediction, what is the best n to use? Sarsa(lambda)! Also can utilise eligibility traces
The lambda approach can help you update more quickly over one episode. See below.
Off-Policy Learning
Goal
- Evaluate target
orv_{\pi}(s)
while following behaviour policyMathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body q_{\pi}(s,a)
to computeMathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \pimu(a|s)
orMathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body v_{\pi}(s)
Why?
- Learn from watching humans
- Re-use experience generated from old policies - cache/batch old episodes
- Learn about optimal policy while following exploratory policy - make sure we explore the state space sufficiently
- Learn about multiple policies while following one policy
- Importance Sampling - has high variance (noise) for Monte-Carlo, but can be made to work for TD.
- while following behaviour policy Target Policy
Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body q_{\pi}(s,a)
- is the policy we are trying to optimise\pi - Behaviour Policy
Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body \mu(a|s)
Why?
- Learn from watching humans
- Re-use experience generated from old policies - cache/batch old episodes
- Learn about optimal policy while following exploratory policy - make sure we explore the state space sufficiently
- Learn about multiple policies while following one policy
- Importance sampling has high variance (noise) for Monte-Carlo, but can be made to work for TD.
- is the policy we are trying to improvebody \mu
Off-Policy Q-Learning
Take the next step from the observed policy, but then consider bootstrapping the rest from the optimal policy
...
- Is the reward the reward gained from
?Mathinline host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 body A_{t+1} - Use an epsilon-greedy policy for the behavioural policy (exploration) and greedy policy for target policy (optimal)