Concise list of terms and concepts in reinforcement learning.

Other Useful Definitions

Active Learning - a form of supervised learning in which the machine may query the supervisor, reinforcement learning is a kind of active learning (it's actions alter what it can sense and learn)
Principle of Optimality - applies for problems when solving the optimality of subproblems leads to the optimal solution of the overall problem

General Taxonomy

Agent Types - policy based, value based, actor critic (policy and value), model free and model-based

History - sequence of observations, actions and rewards ()
Markov - a state is considered Markov if the future can be equivalently determined by the last state (i.e. can throw away all history to the present)
Model - predicts what the environment will do next given the current state and an action, usually two parts
- dynamics: predictor of the next agent state given it's interaction with the environment
- rewards: predictor of the next reward received from it's environment

Policy defines how an agent will behave, it is a map from state to action, can be deterministic or probabilistic

State - State, Agent State, Environment State
Value Function - provides an estimate of how good each state and/or action is
- e.g. a predictor of future reward given a policy and a state ()
- This gets complicated to calculate when policies are probabilistic

Concepts

Exploration vs Exploitation - explore and learn the environment, then exploit the environment to maximise reward (needs careful balance)
Prediction vs Control
- Prediction - What is the value function for the given policy?
- Control - What is the optimal value function, or policy ?
Reinforcement Learning vs Planning
- Reinforcement Learning (interaction) - environment is unknown (no model), interacts with the environment, improves its policy
- Planning (thinking) - environment is known (model), no interaction with the environment, computes with the model, improves its policy
Stationary Processes - Time independence, e.g. MDP states that represent the same logic regardless of the time they are entered.
Stochastic Policies - are sometimes necessary, e.g. rock paper scissors requires a random policy lest it gets exploited by the other player

Markov Decision Processes

Bellman Equation (MRP) - calculates value iteratively per state

Bellman Expectation Equations (MDP) - the bellman equations given actions and subsequently policies (now considering state value and action value functions):

Bellman Optimality Equation (MDP) - since an optimal policy always exists, then given it's deterministic form, the bellman expected equations reduce to

Discount - how much reward diminishes into the future when calculating the expected return
Episode - a specific sequence of states from a Markov Process (Chain)
Flatten - you can always flatten an MDP with a specific policy into an MRP.
Markov Process - a tuple defining a memoryless random system (also known as Markov Chain)
- Markov Reward Process - adds from which values can be computed
- Markov Decision Process - includes actions, which condition the probability transition matrix and reward function
- Partially Observable Markov Decision Process - has hidden states and observation function
Optimality
- Bellman Equations - similar to policy based Belman equations but with and
  - Solving and allow you to compute the optimal policy
  - But they are nonlinear, and in general, no closed form solution - use iterative methods
- State/Action Value Function - is the maximal state/action value function over all policies, e.g.
- Policies - are better than every other policy over all states given a partial ordering if
  - Always exists a deterministic optimal policy for MDP's (see slides)
  - Is not a composition of bits and pieces of other policies, it is a policy on it's own
  - May not be unique
Policy - defines the behaviour of an agent, mathematically it is the distribution of actions over states
Probability State Transition Matrix - probabilities of going from
Return - is the reward into the future over a specific episode (incoporating discounting)
Reward - is the expected reward for the current state
Value Function - how good it is to be in any state, i.e. the expected return from a given state
- State Value Function - expected return from a state following a specific policy
  - i.e. how good is it to be in this state?
- Action Value Function - expected return from a state, taking a specific action, and then following a specific policy
  - i.e. how good is it to take a particular action from a state?

Dynamic Programming

Synchronous backups - used when evaluating a policy, i.e. finding the value function via iteration on the Bellman equation
Greedy Policy Improvement - choose actions with preference towards moving to a state with a better value function (i.e. uses the action value function to prioritise the new policy action)
Policy Evaluation and Iteration - uses evaluation (synchronous backup) and iteration (greedy policy improvement) to converge to optimal value function and policy.
- With care, inner loop (evaluation) and outer loop (improvement) can be interspersed

Value Iteration - Process of working backwards from a known optimal value at a specific state, , to work out the entire optimal value function. No policy needed. Uses the bellman equations (the ones without policy).

Model Free

Prediction

Estimating the value function given a policy from episodic data.

Monte Carlo Methods

First/Every Visit Mean Estimation
- Gather episodes for a policy and then accumulate returns the first or every time you visit a state in the episode.
- Determine the value function by total, incremental or moving window mean calculation

TD

Value Function Update - same as the monte carlo except that it uses the bellman equation to estimate the return beyond the next step

TD Target - is the estimated return , it's a biased estimate
TD(0) - as above, with one step
TD(n-step) - an be extended to multiple steps, the limit of this is Monte Carlo
TD(lamba) - a decaying geometric weight is used to utilise all n-step estimations at once (as fast as TD(0)???)

TIP TD(lambda) interpolates between TD(0) and Monte Carlo. Monte Carlo is typically noisy, TD(0) is typically not enough, TD(lambda) often helps you find the sweet spot.

Online vs Offline - whether you update the value function in the middle of an episode or at the end of an episode
Eligibility Trace - keep track of the frequency and recency of a state visit, use this to determine the value function update step
Forward vs Backward View - didn't really catch this...

Control

Optimise the value function (and subsequently find the optimal policy) for an unknown MDP.

On Policy - 'learning while on the job'
Off Policy - 'look over someone else's shoulder', (e.g. robot looking at a human)

On-Policy Monte Carlo

Method - similar to that for dynamics programming except
- Uses Q : not V, this solves the problem of not knowing the model or more specifically the transition dynamics
- Uses epsilon greedy updates : to make sure that the policy keeps exploring enough to generate sufficient samples for the next step
Epsilon greedy - adds some randomness to the greedy update for more exploratory policy improvement
Speedup - update the policy after every episode

Greedy in the Limit with Infinite Exploration (GLIE)
- All state-action pairs are explored infinitely many times
- The policy converges ona greedy policy
Guarantee GLIE
- Evaluation: Something to do with sampling and counting?
- Improvement: Make sure epsilon goes to zero as the update count goes up

On-Policy TD (Sarsa)

Convergence - guaranteed if policies are GLIE and step sizes satisfy the Robbins-Munro properties for a sequence
- Robbins-Munro - ,
Sarsa(lambda) - decaying geometric weight used to do n-step Sarsa (TD style), can use eligibility traces in the same way to weight for or against recently/frequently visited states

Off-Policy Learning

Reasons - watch a human, re-use episodes from old policies, learn about optimal while exploring
Importance Sampling - not practical for Monte Carlo, too much variance (noise) over so many states in an episode, use for TD and apply a ratio of the policy probabilities for the one step
Target Policy - is the policy we are trying to optimise
Behaviour Policy - is the policy we are trying to improve

Q Learning

Behaviour/Target Policies - use epsilon-greedy (exploration) and greedy (optimal)

Scaling

For both prediction or control, there are many states to consider or directly as an oracular lookup stored in a table. e.g. Go is in the order of . Instead, we try to estimate the value function in the update equations.

Generalise from the seen to the unseen
Update parameters w using Monte Carlo or TD Learning

Which function approximator? Focus on methods with a gradient, e.g. linear combinations and neural networks.

Using linear combinations, we introduce a feature vector x(S) that 'represent' the entirety of the state. E.g. distance from the wall. This is essentially putting 'our knowledge' into the problem to reduce the dimensionality of the state. This is a black art? The value function is then approximated by

Convergence of the prediction algorithms - the bias introduced by bootstrapping can theoretically cause it to diverge (even if in practice it works). There are some recent approaches that can fix the TD problems (Gradient TD / Emphatic TD). The bias introduced by bootstrapping can theoretically cause it to diverge (even if in practice it works). There are some recent approaches that can fix the TD problems (Gradient TD / Emphatic TD).

Convergence of the control algorithms

Batching Methods
- Experience Replay - store the datasets generated from previous (limited perspective) algorithms, sample from it, and then optimise the towards the value function (linear combination, the weights).
  - i.e. use the entire experience to optimise towards a target. Squeeze more out of the data!
- Deep Q-Networks - uses experience replay and fixed Q-targets, algorithm skeleton in Silver's slides.
  - This is a stable algorithm!
    - Experience replay decorrelates the trajectories
    - Two networks (didn't really catch the why of this reason, something about bootstrapping towards a frozen target from some time ago rather than the latest, freshest parameters, this causes spiralling problems)
Linear Least Squares Methods
- Go straight to the solution by setting the expected weight change to zero and solving for the weights directly.
- Only works for linear problems
- Can improve the situation for policy learning

Policy Gradient Approximations

Definitions

State Aliasing - when an environment is partially observed, or your features are reduced so that it is impossible to make a deterministic decision to get to a goal
Stochastic Policy / State Aliasing - can get you out of trouble by just choosing randomly when there is no better idea of what to do, or conflicting motivations

Background

Why Policy Based Reinforcement Learning? - for some problems, it is better or more convenient to learn the policy by wandering around and improving a policy based on rewards without ever actually learning the value functions.
- Sometimes the policy is much less complex than the value function
  - e.g. atari games - if the ball is here then it's fairly obvious you should move to the left, but much harder to realise that the value of moving left in that situation is worth 1714 points.
- Better convergence properties - incremental changes in policy result in incremental changes in actions, while incremental changes in value can result in large changes to policy, subseq. actions, subseq. next value.
- Computing greedy / epsilon-greedy policy from a value function with a large action space (e.g. robot torques) can be super expensive
- Can learn stochastic policies - as opposed to greedy / epsilon-greedy policies derived from value functions which are deterministic in nature
  - Are there alternatives to greedy / epsilon greedy that are still optimal but not deterministic?
- Disadvantage: learning can be slower (tradeoff for being smoother) (value function updates make large pushes to an optimal policy)
- Disadvantage: for some problems complexity in evaluating a policy is worse than updating a value function or deriving a policy from a value function
- Example: rock paper scissors doesn't have an optimal deterministic strategy - value based learning would fail
- Example: partially observed environments or incomplete features can result in a suboptimal deterministic strategy - stochastic policies would be better
Parameterised Policy - approximate the optimal policy by parameterising it:
- Set a differentiable objective function based on this parameter → gradient to update from!
One Distributions - are a useful way of setting up linear on/off features
- e.g.
Objective Functions - how to evaluate a policy
- - use the value at the start position (expected return over an episode)(episodic only!)
- - a moving window average value
- find the average reward per time step
- Maximise the objective function against theta, lots of methods available
- What is the stationary distribution of Markov Chain?
- How do you compute the Vpitheta(s) in a continuing environment?
- Why does he call it u?
Likelihood Ratios - Exploit the following identity.

Score Function - is . Would get used in maximum likelihood learning.
Softmax Policy - basically proportional to some exponentiated value, see also wikipedia about softmax functions for probabilities.
Linear Softmax Policy - exponentiated with weights using a linear combination of features, e.g. .
- The score function is
  - Essentially: Score is 'feature - average feature'
- Where does the latter part of this term come from?
Guassian Policy - use in continuous action spaces where the mean is a linear combination of state features, where and policy is Guassian, i.e.
- The score function is
  - Essientally: action - mean action (how much more than usual am I doing something)
- What is a exactly?
Policy Gradient Theorem - For any differentiable policy , for any of the policy objective functions ,, the policy gradient is (note Q is often just sampled from episodes, what if you don't have episodes?):

Actor-Critic Learning : uses a critic to estimate the action value function, i.e.
- Critic - updates action value function parameters w, use old techniques (e.g. TD(0))
  - When using this we are introducing bias!
- Actor - updates policy parameters in direction suggested by critic

TIP No greedy, or epsilon greedy anymore, we are improving a policy via gradient means. Note that with greedy or epsilon greedy we aren't directly learning the policy, it's an implicit result of learning the action-value function.

Variants / Tricks
- Baseline - reduce variance in the updates (e.g. caused by 1million reward vs 1000 reward in a difference sample)
- Advantage Functions - critic estimates both the state and action values
- Critics at different time scales - estimating V for many lambda...
- Alternative Policy Gradient Directions -
Pain Point - with policy gradient methods, the noise (from sampling) hurts you more and more as your estimates improve (i.e. narrow down).
Summary:

Definitions & Concepts