Q-Learning (aka Quality learning) is an off-policy and value-based method that uses a TD approach to train its action-value function:
Off-policy: a different policy for acting (ex. eps-greedy) and updating (ex. greedy)
Value-based method: optimal policy is a set of actions that maximizes an value or action-value function
TD approach: updates its action-value function at each step instead of the whole episode (like Monte Carlo).
Internally, Q-function has a Q-table : a matrix States ✕ Actions where each cell corresponds to a state-action value
Q - learning algorithm
Step 1: Initialize Q-table ...
... and initial state
Step 2: Sample action using Epsilon Greedy Strategy
E-E tradeoff: Epsilon decay.
Step 3: Perform action A, get reward R and state St+1
Step 4: Update Q(St, At)
Off-policy : after Epsilon greedy policy, we use 100% greedy policy [max Q(St+1, a)] to update Q-value (greedy to maximize Q)
On-policy: after Eplison greedy policy, we still use eplison greedy policy [Q(St+1, a)] to update Q-value (a is choosen Eps-greedy)