Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt

Concise list of terms and concepts in reinforcement learning.

...

General Taxonomy

  • Agent Types - policy based, value based, actor critic (policy and value), model free and model-based

Image Modified

  • History
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyH_t
     - sequence of observations, actions and rewards (
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyO_1,R_1,A_1,\dots,O_t,R_t,A_t
    )
  • Markov - a state is considered Markov if the future can be equivalently determined by the last state (i.e. can throw away all history to the present)
  • Model - predicts what the environment will do next given the current state and an action, usually two parts
    • dynamics: predictor of the next agent state given it's interaction with the environment
    • rewards: predictor of the next reward received from it's environment

Policy 

Mathinline
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body\pi(s) / \pi(a|s)
defines how an agent will behave, it is a map from state to action, can be deterministic or probabilistic

  • State
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyS_t, S^a_t, S^e_t
     - State, Agent State, Environment State
  • Value Function - provides an estimate of how good each state and/or action is
    • e.g. a predictor of future reward given a policy and a state (
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      bodyv_{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3} + \dots | S_t = s]
      )
    • This gets complicated to calculate when policies are probabilistic

Concepts

  • Exploration vs Exploitation - explore and learn the environment, then exploit the environment to maximise reward (needs careful balance)
  • Prediction vs Control
    • Prediction - What is the value function for the given policy?
    • Control What is the optimal value function,  or policy ?
  • Reinforcement Learning vs Planning
    • Reinforcement Learning (interaction) - environment is unknown (no model), interacts with the environment, improves its policy
    • Planning (thinking) - environment is known (model), no interaction with the environment, computes with the model, improves its policy
  • Stationary Processes - Time independence, e.g. MDP states that represent the same logic regardless of the time they are entered.
  • Stochastic Policies - are sometimes necessary, e.g. rock paper scissors requires a random policy lest it gets exploited by the other player

Markov Decision Processes

  • Bellman Equation (MRP) - calculates value iteratively per state


Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
\begin{align}
v &= \mathcal{R} + \gamma \mathcal{P}v \\
  &= (1 - \gamma \mathcal{P})^{-1} \mathcal{R}
\end{align}
  • Bellman Expectation Equations (MDP) - the bellman equations given actions and subsequently policies (now considering state value and action value functions):


Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
\begin{align}
v_{\pi}(s) &= \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s] \\
           &= \sum_{a \in \mathcal{A}} \pi(a|s) \left( \mathcal{R}^a_s + \gamma \mathcal{P}^a_{ss'}v_{\pi}(s') \right) \\
           &= \sum_{a \in \mathcal{A}} \pi(a|s) q_{\pi}(s,a) \\
v_{\pi} &= \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi}v_{\pi} \\
  &= (I - \gamma \mathcal{P}^{\pi})^{-1} \mathcal{R}^{\pi} \\
q_{\pi}(s,a) &= \mathbb{E}_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1},A_{t+1}) | S_t = s, A_t = a] \\
  &= \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \sum_{a' \in \mathcal{A}} \pi(a'|s') q_{\pi}(s',a') \\
  &= \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'}v_{\pi}(s') \\
\end{align}

...

  • Discount 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\gamma
     - how much reward diminishes into the future when calculating the expected return
  • Episode - a specific sequence of states from a Markov Process (Chain)
  • Flatten - you can always flatten an MDP with a specific policy into an MRP.
  • Markov Process 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body<\mathcal{S}, \mathcal{P}>
     
    - a tuple 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body<\mathcal{S}, \mathcal{P}>
     defining a memoryless random system (also known as Markov Chain)
    • Markov Reward Process 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body<\mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma>
       - adds from which values can be computed
    • Markov Decision Process 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>
       - includes actions, which condition the probability transition matrix and reward function
    • Partially Observable Markov Decision Process 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body<\mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \mathcal{Z}, \gamma>
       - has hidden states and observation function 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\mathcal{Z}_{s' o}^a = \mathbb{P}[O_{t+1} = o | S_{t+1} = s', A_t = a]
  • Optimality
    • Bellman Equations - similar to policy based Belman equations but with 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      bodyv_*
       and 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      bodyq_*
      • Solving
        Mathinline
        host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
        bodyv_*
         and 
        Mathinline
        host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
        bodyq_*
         
        allow you to compute the optimal policy
      • But they are nonlinear, and in general, no closed form solution - use iterative methods
    • State/Action Value Function - is the maximal state/action value function over all policies, e.g. 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      bodyv_*(s) = max_{\pi} v_{\pi}(s)
    • Policies - are better than every other policy over all states given a partial ordering 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\pi \geq \pi'
       if 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      bodyv_{\pi}(s) \geq v_{\pi'}(s), \forall s
      • Always exists a deterministic optimal policy for MDP's (see slides)
      • Is not a composition of bits and pieces of other policies, it is a policy on it's own
      • May not be unique
  • Policy - defines the behaviour of an agent, mathematically it is the distribution of actions over states 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi(a|s) = \mathbb{P}(A_t = a | S_t = s)
  • Probability State Transition Matrix 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mathcal{P}_{ss'}
     - probabilities of going from 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mathcal{S}\rightarrow\mathcal{S}'
  • Return 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyG_t
     - is the reward into the future over a specific episode (incoporating discounting)
  • Reward
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mathcal{R}
     - is the expected reward for the current state 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mathcal{R}_s = \mathbb{E}[\mathcal{R}_{t+1} | S_t = s]
  • Value Function 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyv(s)
     - how good it is to be in any state, i.e. the expected return from a given state
    • State Value Function 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      bodyv_{\pi}(s)
       - expected return from a state following a specific policy 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      bodyv_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]
      • i.e. how good is it to be in this state?
    • Action Value Function 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      bodyq_{\pi}(s, a)
       - expected return from a state, taking a specific action, and then following a specific policy 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      bodyq_{\pi}(s,a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]
      • i.e. how good is it to take a particular action from a state?

Dynamic Programming

  • Synchronous backups - used when evaluating a policy, i.e. finding the value function via iteration on the Bellman equation
  • Greedy Policy Improvement - choose actions with preference towards moving to a state with a better value function (i.e. uses the action value function to prioritise the new policy action)
  • Policy Evaluation and Iteration - uses evaluation (synchronous backup) and iteration (greedy policy improvement) to converge to optimal value function and policy.
    • With care, inner loop (evaluation) and outer loop (improvement) can be interspersed

Image Modified

  • Value Iteration - Process of working backwards from a known optimal value at a specific state, 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyv_*(s')
    , to work out the entire optimal value function. No policy needed. Uses the bellman equations (the ones without policy).

Model Free

Prediction

Estimating the value function given a policy from episodic data.

Image Modified

Monte Carlo Methods

  • First/Every Visit Mean Estimation
    • Gather episodes for a policy and then accumulate returns the first or every time you visit a state in the episode.
    • Determine the value function by total, incremental or moving window mean calculation
Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
\begin{align}
V(S_t) &\leftarrow \sum G_t(s) / N(s) \\
&\leftarrow V(S_t) + \frac{1}{N(S_t)} ( G_t - V(S_t)) \\
&\leftarrow V(S_t) + \alpha ( R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\
\end{align}

TD

  • Value Function Update - same as the monte carlo except that it uses the bellman equation to estimate the return beyond the next step
Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
\begin{align}
V(S_t) &\leftarrow V(S_t) + \alpha ( G_t - V(S_t)) \\
&\leftarrow V(S_t) + \alpha ( R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\
\end{align}

...

Tip
iconfalse

Status
colourGreen
titleTip
TD(lambda) interpolates between TD(0) and Monte Carlo. Monte Carlo is typically noisy, TD(0) is typically not enough, TD(lambda) often helps you find the sweet spot.

  • Online vs Offline - whether you update the value function in the middle of an episode or at the end of an episode
  • Eligibility Trace - keep track of the frequency and recency of a state visit, use this to determine the value function update step
  • Forward vs Backward View - didn't really catch this...

Control

Optimise the value function (and subsequently find the optimal policy) for an unknown MDP.

...

  • Method - similar to that for dynamics programming except
    • Uses Q : not V, this solves the problem of not knowing the model or more specifically the transition dynamics
    • Uses epsilon greedy updates : to make sure that the policy keeps exploring enough to generate sufficient samples for the next step
  • Epsilon greedy - adds some randomness to the greedy update for more exploratory policy improvement
  • Speedup - update the policy after every episode

Image Modified

  • Greedy in the Limit with Infinite Exploration (GLIE)
    • All state-action pairs are explored infinitely many times
    • The policy converges ona greedy policy
  • Guarantee GLIE
    • Evaluation: Something to do with sampling and counting?
    • Improvement: Make sure epsilon goes to zero as the update count goes up

On-Policy TD (Sarsa)

  • Convergence - guaranteed if policies are GLIE and step sizes satisfy the Robbins-Munro properties for a sequence
    • Robbins-Munro - 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\sum_i \alpha_i = \infty
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\sum_i \alpha_i^2 < \infty
  • Sarsa(lambda) - decaying geometric weight used to do n-step Sarsa (TD style), can use eligibility traces in the same way to weight for or against recently/frequently visited states

Off-Policy Learning

  • Reasons - watch a human, re-use episodes from old policies, learn about optimal while exploring
  • Importance Sampling - not practical for Monte Carlo, too much variance (noise) over so many states in an episode, use for TD and apply a ratio of the policy probabilities for the one step
  • Target Policy
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi
     - is the policy we are trying to optimise
  • Behaviour Policy
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mu
     - is the policy we are trying to improve

Q Learning

  • Behaviour/Target Policies - use epsilon-greedy (exploration) and greedy (optimal)

Image Modified

Scaling

For both prediction or control, there are many states to consider 

Mathinline
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
bodyV(s)
 or 
Mathinline
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
bodyQ(s,a)
 directly as an oracular lookup stored in a table. e.g. Go is in the order of 
Mathinline
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
body10^{170}
. Instead, we try to estimate the value function in the update equations.

Mathblock
host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
\begin{align}
\hat{v}(s, \boldsymbol{w}) &\simeq v_{\pi}(s) \\
\hat{q}(s, a, \boldsymbol{w}) &\simeq q_{\pi}(s,a) \\
\end{align}

...

  • Batching Methods
    • Experience Replay - store the datasets generated from previous (limited perspective) algorithms, sample from it, and then optimise the towards the value function (linear combination, the weights).
      • i.e. use the entire experience to optimise towards a target. Squeeze more out of the data!
    • Deep Q-Networks - uses experience replay and fixed Q-targets, algorithm skeleton in Silver's slides.
      • This is a stable algorithm!
        • Experience replay decorrelates the trajectories
        • Two networks (didn't really catch the why of this reason, something about bootstrapping towards a frozen target from some time ago rather than the latest, freshest parameters, this causes spiralling problems)
  • Linear Least Squares Methods
    • Go straight to the solution by setting the expected weight change to zero and solving for the weights directly.
    • Only works for linear problems
    • Can improve the situation for policy learning

Image Modified

Policy Gradient Approximations

Definitions

  • State Aliasing - when an environment is partially observed, or your features are reduced so that it is impossible to make a deterministic decision to get to a goal
  • Stochastic Policy / State Aliasing - can get you out of trouble by just choosing randomly when there is no better idea of what to do, or conflicting motivations

Background

  • Why Policy Based Reinforcement Learning? - for some problems, it is better or more convenient to learn the policy by wandering around and improving a policy based on rewards without ever actually learning the value functions.
    • Sometimes the policy is much less complex than the value function
      • e.g. atari games - if the ball is here then it's fairly obvious you should move to the left, but much harder to realise that the value of moving left in that situation is worth 1714 points.
    • Better convergence properties - incremental changes in policy result in incremental changes in actions, while incremental changes in value can result in large changes to policy, subseq. actions, subseq. next value.
    • Computing greedy / epsilon-greedy policy from a value function with a large action space (e.g. robot torques) can be super expensive
    • Can learn stochastic policies - as opposed to greedy / epsilon-greedy policies derived from value functions which are deterministic in nature
      • Are there alternatives to greedy / epsilon greedy that are still optimal but not deterministic?
    • Disadvantage: learning can be slower (tradeoff for being smoother) (value function updates make large pushes to an optimal policy)
    • Disadvantage: for some problems complexity in evaluating a policy is worse than updating a value function or deriving a policy from a value function
    • Example: rock paper scissors doesn't have an optimal deterministic strategy - value based learning would fail
    • Example: partially observed environments or incomplete features can result in a suboptimal deterministic strategy - stochastic policies would be better
  • Parameterised Policy - approximate the optimal policy by parameterising it: 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi_{\theta}(s,a) = \mathbb{P}[a|s, \theta]
    • Set a differentiable objective function based on this parameter → gradient to update from!
  • One Distributions - are a useful way of setting up linear on/off features
    • e.g. 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\phi(s, a) = \boldsymbol{1}(\text{wall to the north, move east})
  • Objective Functions - how to evaluate a policy
    • Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      bodyJ_1
       - use the value at the start position (expected return over an episode)(episodic only!)
    • Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\frac{1}{1-\gamma}J_{avV}
       - a moving window average value
    • Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      bodyJ_{avR}
       find the average reward per time step
    • Maximise the objective function against theta, lots of methods available
    • What is the stationary distribution of Markov Chain?
    • How do you compute the Vpitheta(s) in a continuing environment?
    • Why does he call it u?
  • Likelihood Ratios - Exploit the following identity. 

...

  • Score Function - is 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\nabla_{\theta} \log \pi_{\theta}(s,a)
    . Would get used in maximum likelihood learning.
  • Softmax Policy - basically proportional to some exponentiated value, see also wikipedia about softmax functions for probabilities.
  • Linear Softmax Policy - exponentiated with weights using a linear combination of features, e.g. 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi_{\theta}(s,a) \propto e^{\phi(s,a)^T \theta
    • The score function is 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\nabla_{\theta} \log \pi_{\theta}(s,a) = \phi(s,a) - \mathbb{E}_{\pi_{\theta}}[\phi(s, \cdot)]
      • Essentially: Score is 'feature - average feature'
    • Where does the latter part of this term come from?
  • Guassian Policy - use in continuous action spaces where the mean is a linear combination of state features, where 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\mu(s) = \phi(s)^T\theta
     and policy is Guassian, i.e. 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodya \sim \mathcal{N}(\mu(s), \sigma^2)
    • The score function is 
      Mathinline
      host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
      body\nabla_{\theta} \log \pi_{\theta}(s,a) = \frac{(a - \mu(s)) \phi(s)}{\sigma^2}
      • Essientally: action - mean action (how much more than usual am I doing something)
    • What is a exactly?
  • Policy Gradient Theorem - For any differentiable policy 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\pi_{\theta}(s,a)
    , for any of the policy objective functions 
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyJ_1
    ,
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    body\frac{1}{1-\gamma}J_{avV}
    ,
    Mathinline
    host5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4
    bodyJ_{avR}
     the policy gradient is (note Q is often just sampled from episodes, what if you don't have episodes?):

...