Environment - The World in which our agent lives and interacts with.
Agent - The character interacting with the world.
State - Description of your environment/world. There is no information of the environment which is hidden in the state.
Observation - A partial/complete view of the state, given as input to the Agent
Fully Observed Environment - Complete State = Observation
Partially Observed Environment - Incomplete State = Observation
Action Spaces - Set of all valid actions in an environment
Discrete Action Spaces - Finite Number of Moves
Continuous Action Spaces - Real Valued Vectors
Policies - Rule used by agent to decide which action to take
Stochastic Policies - When the actions is based on a probability distribution
Two common kinds
Categorical Policies - Used in Discrete Action Spaces
Diagonal Gaussian Policies - Used in Continuous Action Spaces
Need to be able to sample actions from policies and compute log likelihoods of particular actions
Trajectories/Episodes/Rollouts - Sequence of States and Actions in the world
The first state is sampled from start-state distribution
Reward - The value that describes the impact of the action taken with the current state and next state.
Return - Cumulative Reward
Finite-Horizon undiscounted Return - Sum of Rewards obtained in a fixed window of steps
Infinite-Horizon Discounted Return - Sum of all rewards ever obtained but discounted by how far off in the future.
Value Functions - Expected Return if start in a state or a state-action pair, then act according to a particular policy
On-Policy Value Function - Expected Return if you start in state and act to according to policy
On-Policy Action-Value Function - Expected Return if you start in state and take an arbitrary action and then forever act according to the optimal policy - aka known as the Q-Function
Optimal Value Function - Expected Return when you start in a state and act according to the optimal policy
Optimal Action-Value Function - Expected Return if you start in a state and take an arbitrary action, and then forever after act according to the optimal policy.
Advantage Functions - How much better an action is compared to the average?
How much better it is to select a specific action over selecting an action randomly and then following the same policy?
Model-Based Algorithms - The agent has access to (or learns) a model of the environment.
Model-Free Algorithms - The agent doesn't has access to a model of the environment.
Policy Optimization - Optimize the parameters representing the policy, and mostly it is on-policy. Basically, on-policy means the user data is collected while acting on the most recent version of the policy.
Q-Learning Optimization - Learn an Approximation for the optimal action value function. Usually performed off-policy.