[RL basics] Week 3. Q-learning

Q-learning algorithm

Q-Learning (aka Quality learning) is an off-policy and value-based method that uses a TD approach to train its action-value function:

Off-policy: a different policy for acting (ex. eps-greedy) and updating (ex. greedy)
Value-based method: optimal policy is a set of actions that maximizes an value or action-value function
TD approach: updates its action-value function at each step instead of the whole episode (like Monte Carlo).

Internally, Q-function has a Q-table : a matrix States ✕ Actions where each cell corresponds to a state-action value