Excerpt
Concise list of terms and concepts in reinforcement learning.

General Taxonomy

Agent Types - policy based, value based, actor critic (policy and value), model free and model-based

Image Modified

History
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body H_t
- sequence of observations, actions and rewards (
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body O_1,R_1,A_1,\dots,O_t,R_t,A_t
)
Markov - a state is considered Markov if the future can be equivalently determined by the last state (i.e. can throw away all history to the present)
Model - predicts what the environment will do next given the current state and an action, usually two parts
- dynamics: predictor of the next agent state given it's interaction with the environment
- rewards: predictor of the next reward received from it's environment

Policy

Mathinline

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body	\pi(s) / \pi(a\|s)

defines how an agent will behave, it is a map from state to action, can be deterministic or probabilistic

State
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body S_t, S^a_t, S^e_t
- State, Agent State, Environment State
Value Function - provides an estimate of how good each state and/or action is
- e.g. a predictor of future reward given a policy and a state (
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body v_{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3} + \dots | S_t = s]
  )
- This gets complicated to calculate when policies are probabilistic

Concepts

Exploration vs Exploitation - explore and learn the environment, then exploit the environment to maximise reward (needs careful balance)
Prediction vs Control
- Prediction - What is the value function for the given policy?
- Control - What is the optimal value function, or policy ?
Reinforcement Learning vs Planning
- Reinforcement Learning (interaction) - environment is unknown (no model), interacts with the environment, improves its policy
- Planning (thinking) - environment is known (model), no interaction with the environment, computes with the model, improves its policy
Stationary Processes - Time independence, e.g. MDP states that represent the same logic regardless of the time they are entered.
Stochastic Policies - are sometimes necessary, e.g. rock paper scissors requires a random policy lest it gets exploited by the other player

Markov Decision Processes

Bellman Equation (MRP) - calculates value iteratively per state

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

\begin{align}
v &= \mathcal{R} + \gamma \mathcal{P}v \\
  &= (1 - \gamma \mathcal{P})^{-1} \mathcal{R}
\end{align}

Bellman Expectation Equations (MDP) - the bellman equations given actions and subsequently policies (now considering state value and action value functions):

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

\begin{align}
v_{\pi}(s) &= \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s] \\
           &= \sum_{a \in \mathcal{A}} \pi(a|s) \left( \mathcal{R}^a_s + \gamma \mathcal{P}^a_{ss'}v_{\pi}(s') \right) \\
           &= \sum_{a \in \mathcal{A}} \pi(a|s) q_{\pi}(s,a) \\
v_{\pi} &= \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi}v_{\pi} \\
  &= (I - \gamma \mathcal{P}^{\pi})^{-1} \mathcal{R}^{\pi} \\
q_{\pi}(s,a) &= \mathbb{E}_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1},A_{t+1}) | S_t = s, A_t = a] \\
  &= \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \sum_{a' \in \mathcal{A}} \pi(a'|s') q_{\pi}(s',a') \\
  &= \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'}v_{\pi}(s') \\
\end{align}

...

Discount
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \gamma
- how much reward diminishes into the future when calculating the expected return
Episode - a specific sequence of states from a Markov Process (Chain)
Flatten - you can always flatten an MDP with a specific policy into an MRP.
Markov Process
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body <\mathcal{S}, \mathcal{P}>
- a tuple
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body <\mathcal{S}, \mathcal{P}>
defining a memoryless random system (also known as Markov Chain)
- Markov Reward Process
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body <\mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma>
  - adds from which values can be computed
- Markov Decision Process
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body <\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>
  - includes actions, which condition the probability transition matrix and reward function
- Partially Observable Markov Decision Process
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body <\mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \mathcal{Z}, \gamma>
  - has hidden states and observation function
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \mathcal{Z}_{s' o}^a = \mathbb{P}[O_{t+1} = o | S_{t+1} = s', A_t = a]
Optimality
- Bellman Equations - similar to policy based Belman equations but with
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body v_*
  and
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body q_*
  - Solving
    Mathinline
    host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body v_*
    and
    Mathinline
    host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body q_*
    allow you to compute the optimal policy
  - But they are nonlinear, and in general, no closed form solution - use iterative methods
- State/Action Value Function - is the maximal state/action value function over all policies, e.g.
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body v_*(s) = max_{\pi} v_{\pi}(s)
- Policies - are better than every other policy over all states given a partial ordering
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \pi \geq \pi'
  if
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body v_{\pi}(s) \geq v_{\pi'}(s), \forall s
  - Always exists a deterministic optimal policy for MDP's (see slides)
  - Is not a composition of bits and pieces of other policies, it is a policy on it's own
  - May not be unique
Policy - defines the behaviour of an agent, mathematically it is the distribution of actions over states
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \pi(a|s) = \mathbb{P}(A_t = a | S_t = s)
Probability State Transition Matrix
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \mathcal{P}_{ss'}
- probabilities of going from
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \mathcal{S}\rightarrow\mathcal{S}'
Return
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body G_t
- is the reward into the future over a specific episode (incoporating discounting)
Reward
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \mathcal{R}
- is the expected reward for the current state
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \mathcal{R}_s = \mathbb{E}[\mathcal{R}_{t+1} | S_t = s]
Value Function
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body v(s)
- how good it is to be in any state, i.e. the expected return from a given state
- State Value Function
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body v_{\pi}(s)
  - expected return from a state following a specific policy
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]
  - i.e. how good is it to be in this state?
- Action Value Function
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body q_{\pi}(s, a)
  - expected return from a state, taking a specific action, and then following a specific policy
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body q_{\pi}(s,a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]
  - i.e. how good is it to take a particular action from a state?

Dynamic Programming

Synchronous backups - used when evaluating a policy, i.e. finding the value function via iteration on the Bellman equation
Greedy Policy Improvement - choose actions with preference towards moving to a state with a better value function (i.e. uses the action value function to prioritise the new policy action)
Policy Evaluation and Iteration - uses evaluation (synchronous backup) and iteration (greedy policy improvement) to converge to optimal value function and policy.
- With care, inner loop (evaluation) and outer loop (improvement) can be interspersed

Image Modified

Value Iteration - Process of working backwards from a known optimal value at a specific state,
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body v_*(s')
, to work out the entire optimal value function. No policy needed. Uses the bellman equations (the ones without policy).

Model Free

Prediction

Estimating the value function given a policy from episodic data.

Image Modified

Monte Carlo Methods

First/Every Visit Mean Estimation
- Gather episodes for a policy and then accumulate returns the first or every time you visit a state in the episode.
- Determine the value function by total, incremental or moving window mean calculation

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

\begin{align}
V(S_t) &\leftarrow \sum G_t(s) / N(s) \\
&\leftarrow V(S_t) + \frac{1}{N(S_t)} ( G_t - V(S_t)) \\
&\leftarrow V(S_t) + \alpha ( R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\
\end{align}

TD

Value Function Update - same as the monte carlo except that it uses the bellman equation to estimate the return beyond the next step

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

\begin{align}
V(S_t) &\leftarrow V(S_t) + \alpha ( G_t - V(S_t)) \\
&\leftarrow V(S_t) + \alpha ( R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\
\end{align}

...

Tip

icon	false

Status

colour	Green
title	Tip

TD(lambda) interpolates between TD(0) and Monte Carlo. Monte Carlo is typically noisy, TD(0) is typically not enough, TD(lambda) often helps you find the sweet spot.

Online vs Offline - whether you update the value function in the middle of an episode or at the end of an episode
Eligibility Trace - keep track of the frequency and recency of a state visit, use this to determine the value function update step
Forward vs Backward View - didn't really catch this...

Control

Optimise the value function (and subsequently find the optimal policy) for an unknown MDP.

...

Method - similar to that for dynamics programming except
- Uses Q : not V, this solves the problem of not knowing the model or more specifically the transition dynamics
- Uses epsilon greedy updates : to make sure that the policy keeps exploring enough to generate sufficient samples for the next step
Epsilon greedy - adds some randomness to the greedy update for more exploratory policy improvement
Speedup - update the policy after every episode

Image Modified

Greedy in the Limit with Infinite Exploration (GLIE)
- All state-action pairs are explored infinitely many times
- The policy converges ona greedy policy
Guarantee GLIE
- Evaluation: Something to do with sampling and counting?
- Improvement: Make sure epsilon goes to zero as the update count goes up

On-Policy TD (Sarsa)

Convergence - guaranteed if policies are GLIE and step sizes satisfy the Robbins-Munro properties for a sequence
- Robbins-Munro -
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \sum_i \alpha_i = \infty
  ,
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \sum_i \alpha_i^2 < \infty
Sarsa(lambda) - decaying geometric weight used to do n-step Sarsa (TD style), can use eligibility traces in the same way to weight for or against recently/frequently visited states

Off-Policy Learning

Reasons - watch a human, re-use episodes from old policies, learn about optimal while exploring
Importance Sampling - not practical for Monte Carlo, too much variance (noise) over so many states in an episode, use for TD and apply a ratio of the policy probabilities for the one step
Target Policy
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \pi
- is the policy we are trying to optimise
Behaviour Policy
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \mu
- is the policy we are trying to improve

Q Learning

Behaviour/Target Policies - use epsilon-greedy (exploration) and greedy (optimal)

Image Modified

Scaling

For both prediction or control, there are many states to consider

Mathinline

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body	V(s)

or

Mathinline

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body	Q(s,a)

directly as an oracular lookup stored in a table. e.g. Go is in the order of

Mathinline

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body	10^{170}

. Instead, we try to estimate the value function in the update equations.

Mathblock

host	5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4

\begin{align}
\hat{v}(s, \boldsymbol{w}) &\simeq v_{\pi}(s) \\
\hat{q}(s, a, \boldsymbol{w}) &\simeq q_{\pi}(s,a) \\
\end{align}

...

Batching Methods
- Experience Replay - store the datasets generated from previous (limited perspective) algorithms, sample from it, and then optimise the towards the value function (linear combination, the weights).
  - i.e. use the entire experience to optimise towards a target. Squeeze more out of the data!
- Deep Q-Networks - uses experience replay and fixed Q-targets, algorithm skeleton in Silver's slides.
  - This is a stable algorithm!
    - Experience replay decorrelates the trajectories
    - Two networks (didn't really catch the why of this reason, something about bootstrapping towards a frozen target from some time ago rather than the latest, freshest parameters, this causes spiralling problems)
Linear Least Squares Methods
- Go straight to the solution by setting the expected weight change to zero and solving for the weights directly.
- Only works for linear problems
- Can improve the situation for policy learning

Image Modified

Policy Gradient Approximations

Definitions

State Aliasing - when an environment is partially observed, or your features are reduced so that it is impossible to make a deterministic decision to get to a goal
Stochastic Policy / State Aliasing - can get you out of trouble by just choosing randomly when there is no better idea of what to do, or conflicting motivations

Background

Why Policy Based Reinforcement Learning? - for some problems, it is better or more convenient to learn the policy by wandering around and improving a policy based on rewards without ever actually learning the value functions.
- Sometimes the policy is much less complex than the value function
  - e.g. atari games - if the ball is here then it's fairly obvious you should move to the left, but much harder to realise that the value of moving left in that situation is worth 1714 points.
- Better convergence properties - incremental changes in policy result in incremental changes in actions, while incremental changes in value can result in large changes to policy, subseq. actions, subseq. next value.
- Computing greedy / epsilon-greedy policy from a value function with a large action space (e.g. robot torques) can be super expensive
- Can learn stochastic policies - as opposed to greedy / epsilon-greedy policies derived from value functions which are deterministic in nature
  - Are there alternatives to greedy / epsilon greedy that are still optimal but not deterministic?
- Disadvantage: learning can be slower (tradeoff for being smoother) (value function updates make large pushes to an optimal policy)
- Disadvantage: for some problems complexity in evaluating a policy is worse than updating a value function or deriving a policy from a value function
- Example: rock paper scissors doesn't have an optimal deterministic strategy - value based learning would fail
- Example: partially observed environments or incomplete features can result in a suboptimal deterministic strategy - stochastic policies would be better
Parameterised Policy - approximate the optimal policy by parameterising it:
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \pi_{\theta}(s,a) = \mathbb{P}[a|s, \theta]
- Set a differentiable objective function based on this parameter → gradient to update from!
One Distributions - are a useful way of setting up linear on/off features
- e.g.
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \phi(s, a) = \boldsymbol{1}(\text{wall to the north, move east})
Objective Functions - how to evaluate a policy
- Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body J_1
  - use the value at the start position (expected return over an episode)(episodic only!)
- Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \frac{1}{1-\gamma}J_{avV}
  - a moving window average value
- Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body J_{avR}
  find the average reward per time step
- Maximise the objective function against theta, lots of methods available
- What is the stationary distribution of Markov Chain?
- How do you compute the Vpitheta(s) in a continuing environment?
- Why does he call it u?
Likelihood Ratios - Exploit the following identity.

...

Score Function - is
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \nabla_{\theta} \log \pi_{\theta}(s,a)
. Would get used in maximum likelihood learning.
Softmax Policy - basically proportional to some exponentiated value, see also wikipedia about softmax functions for probabilities.
Linear Softmax Policy - exponentiated with weights using a linear combination of features, e.g.
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \pi_{\theta}(s,a) \propto e^{\phi(s,a)^T \theta
.
- The score function is
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \nabla_{\theta} \log \pi_{\theta}(s,a) = \phi(s,a) - \mathbb{E}_{\pi_{\theta}}[\phi(s, \cdot)]
  - Essentially: Score is 'feature - average feature'
- Where does the latter part of this term come from?
Guassian Policy - use in continuous action spaces where the mean is a linear combination of state features, where
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \mu(s) = \phi(s)^T\theta
and policy is Guassian, i.e.
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body a \sim \mathcal{N}(\mu(s), \sigma^2)
- The score function is
  Mathinline
  host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
  body \nabla_{\theta} \log \pi_{\theta}(s,a) = \frac{(a - \mu(s)) \phi(s)}{\sigma^2}
  - Essientally: action - mean action (how much more than usual am I doing something)
- What is a exactly?
Policy Gradient Theorem - For any differentiable policy
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \pi_{\theta}(s,a)
, for any of the policy objective functions
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body J_1
,
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body \frac{1}{1-\gamma}J_{avV}
,
Mathinline
host 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body J_{avR}
the policy gradient is (note Q is often just sampled from episodes, what if you don't have episodes?):

...

Versions Compared

Old Version 112

New Version 113

Key

General Taxonomy

Concepts

Markov Decision Processes

Dynamic Programming

Model Free

Prediction

Control

Scaling

Policy Gradient Approximations

Definitions

Background

Page Comparison

Versions Compared

Old Version 112

New Version 113

Key

General Taxonomy

Concepts

Markov Decision Processes

Dynamic Programming

Model Free

Prediction

Control

Scaling

Policy Gradient Approximations

Definitions

Background