Definitions & Concepts
Concise list of terms and concepts in reinforcement learning.
Other Useful Definitions
- Active Learning - a form of supervised learning in which the machine may query the supervisor, reinforcement learning is a kind of active learning (it's actions alter what it can sense and learn)
- Principle of Optimality - applies for problems when solving the optimality of subproblems leads to the optimal solution of the overall problem
General Taxonomy
- Agent Types - policy based, value based, actor critic (policy and value), model free and model-based
- History - sequence of observations, actions and rewards ()
- Markov - a state is considered Markov if the future can be equivalently determined by the last state (i.e. can throw away all history to the present)
- Model - predicts what the environment will do next given the current state and an action, usually two parts
- dynamics: predictor of the next agent state given it's interaction with the environment
- rewards: predictor of the next reward received from it's environment
Policy defines how an agent will behave, it is a map from state to action, can be deterministic or probabilistic
- State - State, Agent State, Environment State
- Value Function - provides an estimate of how good each state and/or action is
- e.g. a predictor of future reward given a policy and a state ()
- This gets complicated to calculate when policies are probabilistic
Concepts
- Exploration vs Exploitation - explore and learn the environment, then exploit the environment to maximise reward (needs careful balance)
- Prediction vs Control
- Prediction - What is the value function for the given policy?
- Control - What is the optimal value function, or policy ?
- Reinforcement Learning vs Planning
- Reinforcement Learning (interaction) - environment is unknown (no model), interacts with the environment, improves its policy
- Planning (thinking) - environment is known (model), no interaction with the environment, computes with the model, improves its policy
- Stationary Processes - Time independence, e.g. MDP states that represent the same logic regardless of the time they are entered.
- Stochastic Policies - are sometimes necessary, e.g. rock paper scissors requires a random policy lest it gets exploited by the other player
Markov Decision Processes
- Bellman Equation (MRP) - calculates value iteratively per state
- Bellman Expectation Equations (MDP) - the bellman equations given actions and subsequently policies (now considering state value and action value functions):
- Bellman Optimality Equation (MDP) - since an optimal policy always exists, then given it's deterministic form, the bellman expected equations reduce to
- Discount - how much reward diminishes into the future when calculating the expected return
- Episode - a specific sequence of states from a Markov Process (Chain)
- Flatten - you can always flatten an MDP with a specific policy into an MRP.
- Markov Process - a tuple defining a memoryless random system (also known as Markov Chain)
- Markov Reward Process - adds from which values can be computed
- Markov Decision Process - includes actions, which condition the probability transition matrix and reward function
- Partially Observable Markov Decision Process - has hidden states and observation function
- Optimality
- Bellman Equations - similar to policy based Belman equations but with