Definitions & Concepts

Concise list of terms and concepts in reinforcement learning.




Other Useful Definitions

  • Active Learning - a form of supervised learning in which the machine may query the supervisor, reinforcement learning is a kind of active learning (it's actions alter what it can sense and learn)
  • Principle of Optimality - applies for problems when solving the optimality of subproblems leads to the optimal solution of the overall problem


General Taxonomy

  • Agent Types - policy based, value based, actor critic (policy and value), model free and model-based

  • History  - sequence of observations, actions and rewards ()
  • Markov - a state is considered Markov if the future can be equivalently determined by the last state (i.e. can throw away all history to the present)
  • Model - predicts what the environment will do next given the current state and an action, usually two parts
    • dynamics: predictor of the next agent state given it's interaction with the environment
    • rewards: predictor of the next reward received from it's environment

Policy defines how an agent will behave, it is a map from state to action, can be deterministic or probabilistic

  • State  - State, Agent State, Environment State
  • Value Function - provides an estimate of how good each state and/or action is
    • e.g. a predictor of future reward given a policy and a state ()
    • This gets complicated to calculate when policies are probabilistic

Concepts

  • Exploration vs Exploitation - explore and learn the environment, then exploit the environment to maximise reward (needs careful balance)
  • Prediction vs Control
    • Prediction - What is the value function for the given policy?
    • Control What is the optimal value function,  or policy ?
  • Reinforcement Learning vs Planning
    • Reinforcement Learning (interaction) - environment is unknown (no model), interacts with the environment, improves its policy
    • Planning (thinking) - environment is known (model), no interaction with the environment, computes with the model, improves its policy
  • Stationary Processes - Time independence, e.g. MDP states that represent the same logic regardless of the time they are entered.
  • Stochastic Policies - are sometimes necessary, e.g. rock paper scissors requires a random policy lest it gets exploited by the other player

Markov Decision Processes

  • Bellman Equation (MRP) - calculates value iteratively per state
  • Bellman Expectation Equations (MDP) - the bellman equations given actions and subsequently policies (now considering state value and action value functions):


  • Bellman Optimality Equation (MDP) - since an optimal policy always exists, then given it's deterministic form, the bellman expected equations reduce to
  • Discount  - how much reward diminishes into the future when calculating the expected return
  • Episode - a specific sequence of states from a Markov Process (Chain)
  • Flatten - you can always flatten an MDP with a specific policy into an MRP.
  • Markov Process  - a tuple  defining a memoryless random system (also known as Markov Chain)
    • Markov Reward Process  - adds from which values can be computed
    • Markov Decision Process  - includes actions, which condition the probability transition matrix and reward function
    • Partially Observable Markov Decision Process  - has hidden states and observation function 
  • Optimality
    • Bellman Equations - similar to policy based Belman equations but with  and 
      • Solving  and  allow you to compute the optimal policy
      • But they are nonlinear, and in general, no closed form solution - use iterative methods
    • State/Action Value Function - is the maximal state/action value function over all policies, e.g. 
    • Policies - are better than every other policy over all states given a partial ordering  if 
      • Always exists a deterministic optimal policy for MDP's (see slides)
      • Is not a composition of bits and pieces of other policies, it is a policy on it's own
      • May not be unique
  • Policy - defines the behaviour of an agent, mathematically it is the distribution of actions over states 
  • Probability State Transition Matrix  - probabilities of going from 
  • Return  - is the reward into the future over a specific episode (incoporating discounting)
  • Reward  - is the expected reward for the current state 
  • Value Function  - how good it is to be in any state, i.e. the expected return from a given state
    • State Value Function  - expected return from a state following a specific policy 
      • i.e. how good is it to be in this state?
    • Action Value Function  - expected return from a state, taking a specific action, and then following a specific policy 
      • i.e. how good is it to take a particular action from a state?

Dynamic Programming

  • Synchronous backups - used when evaluating a policy, i.e. finding the value function via iteration on the Bellman equation
  • Greedy Policy Improvement - choose actions with preference towards moving to a state with a better value function (i.e. uses the action value function to prioritise the new policy action)
  • Policy Evaluation and Iteration - uses evaluation (synchronous backup) and iteration (greedy policy improvement) to converge to optimal value function and policy.
    • With care, inner loop (evaluation) and outer loop (improvement) can be interspersed

  • Value Iteration - Process of working backwards from a known optimal value at a specific state, , to work out the entire optimal value function. No policy needed. Uses the bellman equations (the ones without policy).

Model Free

Prediction

Estimating the value function given a policy from episodic data.

Monte Carlo Methods

  • First/Every Visit Mean Estimation
    • Gather episodes for a policy and then accumulate returns the first or every time you visit a state in the episode.
    • Determine the value function by total, incremental or moving window mean calculation

TD

  • Value Function Update - same as the monte carlo except that it uses the bellman equation to estimate the return beyond the next step
  • TD Target - is the estimated return , it's a biased estimate
  • TD(0) - as above, with one step
  • TD(n-step) - an be extended to multiple steps, the limit of this is Monte Carlo
  • TD(lamba) - a decaying geometric weight is used to utilise all n-step estimations at once (as fast as TD(0)???)

TIP TD(lambda) interpolates between TD(0) and Monte Carlo. Monte Carlo is typically noisy, TD(0) is typically not enough, TD(lambda) often helps you find the sweet spot.

  • Online vs Offline - whether you update the value function in the middle of an episode or at the end of an episode
  • Eligibility Trace - keep track of the frequency and recency of a state visit, use this to determine the value function update step
  • Forward vs Backward View - didn't really catch this...

Control

Optimise the value function (and subsequently find the optimal policy) for an unknown MDP.

  • On Policy - 'learning while on the job'
  • Off Policy - 'look over someone else's shoulder', (e.g. robot looking at a human)

On-Policy Monte Carlo

  • Method - similar to that for dynamics programming except
    • Uses Q : not V, this solves the problem of not knowing the model or more specifically the transition dynamics
    • Uses epsilon greedy updates : to make sure that the policy keeps exploring enough to generate sufficient samples for the next step
  • Epsilon greedy - adds some randomness to the greedy update for more exploratory policy improvement
  • Speedup - update the policy after every episode

  • Greedy in the Limit with Infinite Exploration (GLIE)
    • All state-action pairs are explored infinitely many times
    • The policy converges ona greedy policy
  • Guarantee GLIE
    • Evaluation: Something to do with sampling and counting?
    • Improvement: Make sure epsilon goes to zero as the update count goes up

On-Policy TD (Sarsa)

  • Convergence - guaranteed if policies are GLIE and step sizes satisfy the Robbins-Munro properties for a sequence
    • Robbins-Munro - 
  • Sarsa(lambda) - decaying geometric weight used to do n-step Sarsa (TD style), can use eligibility traces in the same way to weight for or against recently/frequently visited states

Off-Policy Learning

  • Reasons - watch a human, re-use episodes from old policies, learn about optimal while exploring
  • Importance Sampling - not practical for Monte Carlo, too much variance (noise) over so many states in an episode, use for TD and apply a ratio of the policy probabilities for the one step
  • Target Policy  - is the policy we are trying to optimise
  • Behaviour Policy  - is the policy we are trying to improve

Q Learning

  • Behaviour/Target Policies - use epsilon-greedy (exploration) and greedy (optimal)

Scaling

For both prediction or control, there are many states to consider  or  directly as an oracular lookup stored in a table. e.g. Go is in the order of . Instead, we try to estimate the value function in the update equations.

  • Generalise from the seen to the unseen
  • Update parameters w using Monte Carlo or TD Learning

Which function approximator? Focus on methods with a gradient, e.g. linear combinations and neural networks.

Using linear combinations, we introduce a feature vector x(S) that 'represent' the entirety of the state. E.g. distance from the wall. This is essentially putting 'our knowledge' into the problem to reduce the dimensionality of the state. This is a black art? The value function is then approximated by

Convergence of the prediction algorithms - the bias introduced by bootstrapping can theoretically cause it to diverge (even if in practice it works). There are some recent approaches that can fix the TD problems (Gradient TD / Emphatic TD). The bias introduced by bootstrapping can theoretically cause it to diverge (even if in practice it works). There are some recent approaches that can fix the TD problems (Gradient TD / Emphatic TD).

Convergence of the control algorithms

  • Batching Methods
    • Experience Replay - store the datasets generated from previous (limited perspective) algorithms, sample from it, and then optimise the towards the value function (linear combination, the weights).
      • i.e. use the entire experience to optimise towards a target. Squeeze more out of the data!
    • Deep Q-Networks - uses experience replay and fixed Q-targets, algorithm skeleton in Silver's slides.
      • This is a stable algorithm!
        • Experience replay decorrelates the trajectories
        • Two networks (didn't really catch the why of this reason, something about bootstrapping towards a frozen target from some time ago rather than the latest, freshest parameters, this causes spiralling problems)
  • Linear Least Squares Methods
    • Go straight to the solution by setting the expected weight change to zero and solving for the weights directly.
    • Only works for linear problems
    • Can improve the situation for policy learning

Policy Gradient Approximations

Definitions

  • State Aliasing - when an environment is partially observed, or your features are reduced so that it is impossible to make a deterministic decision to get to a goal
  • Stochastic Policy / State Aliasing - can get you out of trouble by just choosing randomly when there is no better idea of what to do, or conflicting motivations

Background

  • Why Policy Based Reinforcement Learning? - for some problems, it is better or more convenient to learn the policy by wandering around and improving a policy based on rewards without ever actually learning the value functions.
    • Sometimes the policy is much less complex than the value function
      • e.g. atari games - if the ball is here then it's fairly obvious you should move to the left, but much harder to realise that the value of moving left in that situation is worth 1714 points.
    • Better convergence properties - incremental changes in policy result in incremental changes in actions, while incremental changes in value can result in large changes to policy, subseq. actions, subseq. next value.
    • Computing greedy / epsilon-greedy policy from a value function with a large action space (e.g. robot torques) can be super expensive
    • Can learn stochastic policies - as opposed to greedy / epsilon-greedy policies derived from value functions which are deterministic in nature
      • Are there alternatives to greedy / epsilon greedy that are still optimal but not deterministic?
    • Disadvantage: learning can be slower (tradeoff for being smoother) (value function updates make large pushes to an optimal policy)
    • Disadvantage: for some problems complexity in evaluating a policy is worse than updating a value function or deriving a policy from a value function
    • Example: rock paper scissors doesn't have an optimal deterministic strategy - value based learning would fail
    • Example: partially observed environments or incomplete features can result in a suboptimal deterministic strategy - stochastic policies would be better
  • Parameterised Policy - approximate the optimal policy by parameterising it: 
    • Set a differentiable objective function based on this parameter → gradient to update from!
  • One Distributions - are a useful way of setting up linear on/off features
    • e.g. 
  • Objective Functions - how to evaluate a policy
    •  - use the value at the start position (expected return over an episode)(episodic only!)
    •  - a moving window average value
    •  find the average reward per time step
    • Maximise the objective function against theta, lots of methods available
    • What is the stationary distribution of Markov Chain?
    • How do you compute the Vpitheta(s) in a continuing environment?
    • Why does he call it u?
  • Likelihood Ratios - Exploit the following identity. 
  • Score Function - is . Would get used in maximum likelihood learning.
  • Softmax Policy - basically proportional to some exponentiated value, see also wikipedia about softmax functions for probabilities.
  • Linear Softmax Policy - exponentiated with weights using a linear combination of features, e.g. 
    • There is a partition function encapsulated in the proportional part here, see the wikipedia link for a similar example. 
    • The score function is 
      • Essentially: Score is 'feature - average feature'
      • The second term comes from the gradient due to the proportional part.
  • Guassian Policy - use in continuous action spaces where the mean is a linear combination of state features, where  and policy is Guassian, i.e. 
    • The score function is 
      • Essientally: action - mean action (how much more than usual am I doing something)
      • Remember a is part of a continuous action space, hence a normal distribution on a random variable in the reals makes sense
  • Policy Gradient Theorem - For any differentiable policy , for any of the policy objective functions ,, the policy gradient is (note Q is often just sampled from episodes, what if you don't have episodes?):


  • Actor-Critic Learning : uses a critic to estimate the action value function, i.e. 
    • Critic - updates action value function parameters w, use old techniques (e.g. TD(0))
      • When using this we are introducing bias!
    • Actor - updates policy parameters  in direction suggested by critic

TIP No greedy, or epsilon greedy anymore, we are improving a policy via gradient means. Note that with greedy or epsilon greedy we aren't directly learning the policy, it's an implicit result of learning the action-value function.

  • Variants / Tricks
    • Baseline - reduce variance in the updates (e.g. caused by 1million reward vs 1000 reward in a difference sample)
    • Advantage Functions - critic estimates both the state and action values
    • Critics at different time scales - estimating V for many lambda...
    • Alternative Policy Gradient Directions
  • Pain Point - with policy gradient methods, the noise (from sampling) hurts you more and more as your estimates improve (i.e. narrow down).
  • Summary: