Excerpt |
---|
Concise list of terms and concepts in reinforcement learning. |
...
General Taxonomy
- Agent Types - policy based, value based, actor critic (policy and value), model free and model-based
Image Modified
- History
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | H_t |
---|
|
- sequence of observations, actions and rewards ( Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | O_1,R_1,A_1,\dots,O_t,R_t,A_t |
---|
|
) - Markov - a state is considered Markov if the future can be equivalently determined by the last state (i.e. can throw away all history to the present)
- Model - predicts what the environment will do next given the current state and an action, usually two parts
- dynamics: predictor of the next agent state given it's interaction with the environment
- rewards: predictor of the next reward received from it's environment
Policy Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | \pi(s) / \pi(a|s) |
---|
|
defines how an agent will behave, it is a map from state to action, can be deterministic or probabilistic
- State
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | S_t, S^a_t, S^e_t |
---|
|
- State, Agent State, Environment State - Value Function - provides an estimate of how good each state and/or action is
- e.g. a predictor of future reward given a policy and a state (
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | v_{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3} + \dots | S_t = s] |
---|
|
) - This gets complicated to calculate when policies are probabilistic
Concepts
- Exploration vs Exploitation - explore and learn the environment, then exploit the environment to maximise reward (needs careful balance)
- Prediction vs Control
- Prediction - What is the value function for the given policy?
- Control - What is the optimal value function, or policy ?
- Reinforcement Learning vs Planning
- Reinforcement Learning (interaction) - environment is unknown (no model), interacts with the environment, improves its policy
- Planning (thinking) - environment is known (model), no interaction with the environment, computes with the model, improves its policy
- Stationary Processes - Time independence, e.g. MDP states that represent the same logic regardless of the time they are entered.
- Stochastic Policies - are sometimes necessary, e.g. rock paper scissors requires a random policy lest it gets exploited by the other player
Markov Decision Processes
- Bellman Equation (MRP) - calculates value iteratively per state
Mathblock |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
|
\begin{align}
v &= \mathcal{R} + \gamma \mathcal{P}v \\
&= (1 - \gamma \mathcal{P})^{-1} \mathcal{R}
\end{align} |
- Bellman Expectation Equations (MDP) - the bellman equations given actions and subsequently policies (now considering state value and action value functions):
Mathblock |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
|
\begin{align}
v_{\pi}(s) &= \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s] \\
&= \sum_{a \in \mathcal{A}} \pi(a|s) \left( \mathcal{R}^a_s + \gamma \mathcal{P}^a_{ss'}v_{\pi}(s') \right) \\
&= \sum_{a \in \mathcal{A}} \pi(a|s) q_{\pi}(s,a) \\
v_{\pi} &= \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi}v_{\pi} \\
&= (I - \gamma \mathcal{P}^{\pi})^{-1} \mathcal{R}^{\pi} \\
q_{\pi}(s,a) &= \mathbb{E}_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1},A_{t+1}) | S_t = s, A_t = a] \\
&= \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \sum_{a' \in \mathcal{A}} \pi(a'|s') q_{\pi}(s',a') \\
&= \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'}v_{\pi}(s') \\
\end{align} |
...
- Discount
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | \gamma |
---|
|
- how much reward diminishes into the future when calculating the expected return - Episode - a specific sequence of states from a Markov Process (Chain)
- Flatten - you can always flatten an MDP with a specific policy into an MRP.
- Markov Process
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | <\mathcal{S}, \mathcal{P}> |
---|
|
- a tuple Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | <\mathcal{S}, \mathcal{P}> |
---|
|
defining a memoryless random system (also known as Markov Chain)- Markov Reward Process
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | <\mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma> |
---|
|
- adds from which values can be computed - Markov Decision Process
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | <\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma> |
---|
|
- includes actions, which condition the probability transition matrix and reward function - Partially Observable Markov Decision Process
Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | <\mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \mathcal{Z}, \gamma> |
---|
|
- has hidden states and observation function Mathinline |
---|
host | 5cf3c9eb-f97f-3ae9-acb7-6704dfd8f9e4 |
---|
body | \mathcal{Z}_{s' o}^a = \mathbb{P}[O_{t+1} = o | S_{t+1} = s', A_t = a] |
---|
|
- Optimality
- Bellman Equations - similar to policy based Belman equations but with