Problem: tabular methods are not scalable.
Idea: instead of training Q-table for each state ☓ action, approximate the Q-value using parametrized Q function (ex.Neural Networks)
Q-learning
Deep Q-learning
Img 1. DQN architecture
Img 2. Preprocessing reduces complexity
Original input: 160 ☓ 210 ☓ 3
Reduce size and color channels:
3D RBG to 1D Gray
resize image to 84 ☓ 84
New size: 84 ☓ 84 ☓ 1
Temporal limitations (inertion)
stack of 4 frames
Final size: 84 ☓ 84 ☓ 4
p.s.we also could crop image to important gamezone
In Q-learning, we update Q-table by the following formula:
In DQN learning, we define a Loss function between our Q-value approximation and Q-target.
Then, we use Gradient Descent to update the NN weights to better approximate next DQN Q-values.
Note: we use second Q_hat NN to prevent moving target optimization
(see Fixed Q-target below).
DQN might suffer from instability as we combine non-linearity (NN) with bootstraping (update NN on existing estimations and not ground truth).
To help stabilize the network we apply:
Experience replay : stores experience tuples to be later sampled within mini-bathces (size N is a hyperparameter)
reuse and learn from particular experience multiple times (without cost of new sampling)
reduce correlation between sequential samples (avoid forgetting previous experience, thus avoid weights overwriting )
Fixed Q-target : prevent moving rarget optimization:
We want to reduce the error between target and prediction. By updating weights, predictions become closer to initial target (Good), but new weights also make new target move away from initial target, thus increasing the error
we use a separate "target" network to fix the target. Every C step (hyperparameter), we merge learned weights to "target" netwrok.
Double Deep Q-learning (read more)
Problem: Over-estimation of Q-values