Goal: Train smaller model with the same performance
Model:
Dataset:
Loss: KL divergence of teacher vs student logits
Steps:
Idea: augment the ground truth labels with a distribution of “soft probabilities” from teacher
KD mechanism
Example: DistillBERT
where
L_mlm is teacher (T) original loss function
L_KD is KL divergence of S and T distributions
L_cos is spacial distance of S and T hidden states
We observe in practice the efficiency when:
teacher & student of the same model type (architecture)