Markov decision processes formally describe an environment for Reinforcement Learning when the environment is fully observable.


Resources

Prediction

These methods estimate the value function of an unknown MDP.

Monte Carlo

Method

The increments can be done on first visits only, or on every visit and the limiting results will still hold. Why one or the other?

Typically the value function is estimated and updated on the fly with incremental mean calculations:


V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} ( G_t - V(S_t))

or  over a moving window (slowly forget old episode contributions):

V(S_t) \leftarrow V(S_t) + \alpha ( G_t - V(S_t))

Temporal Difference Learning

Fundamentally different from Monte Carlo in that incomplete episodes are permissable. For the remainder of the return, a guess is provided to bootstrap the calculations.

\begin{align}
V(S_t) &\leftarrow V(S_t) + \alpha ( G_t - V(S_t)) \\
&\leftarrow V(S_t) + \alpha ( R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\
\end{align}

Comparisons

PropertyDynamic ProgrammingMonte CarloTD
EpisodesNoYes, terminatingYes, may be incomplete
BootstrappingYesNoYes
SamplingNoYesYes
Initial ValueNot sensitiveNot sensitiveSensitive
BiasNo BiasNo BiasBias estimate
Nosie/VarianceNoneLots (many R's in the measured G_t)Low (one R before the bias estimate)
Batching-Converges to a best fit for the observationsConverges to a max likelihood given all the data


More effective in non-Markov environmentsMore efficient in Markov environments


Control

How can we find the optimal value function and subsequently policy? The algorithms are mostly concerned with On/Off Policy Monte Carlo and Temporal-Difference learning.

On-Policy Monte-Carlo Iteration

Basic premise is to apply the same process that dynamic programming uses to greedily improve the policy.

But this has two problems:

\pi'(s) = argmax_{a \in A} \mathcal{R}^a_{s} + \mathcal{P}^a_{ss'}V(s') \hspace{1cm} \rightarrow \hspace{1cm} \pi'(s) = argmax_{a \in \mathcal{A}} Q(s,a)

\pi(a|s) = \begin{cases}
\epsilon/m + 1 - \epsilon & \mathrm{if} \; a^* = argmax_{a \in A} Q(s,a) \\
\epsilon/m &  \mathrm{otherwise} \\
\end{cases}

On-Policy TD (Sarsa)


The lambda approach can help you update more quickly over one episode. See below.

Off-Policy Learning

Goal

Why?


Off-Policy Q-Learning

Take the next step from the observed policy, but then consider bootstrapping the rest from the optimal policy

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t))