Knowledge Distillation $\approx$ Label Smoothing: Fact or Fallacy?
Md Arafat Sultan

TL;DR
This paper critically examines the relationship between knowledge distillation and label smoothing, revealing they often produce opposite confidence behaviors and reaffirming KD as a knowledge transfer method rather than just regularization.
Contribution
The study provides empirical evidence that KD and LS differ fundamentally in how they influence model confidence, challenging the view that KD is merely a form of regularization.
Findings
KD and LS often have opposite effects on model confidence
KD involves inheriting confidence from the teacher model
Experiments conducted on four text classification tasks
Abstract
Originally proposed as a method for knowledge transfer from one model to another, some recent studies have suggested that knowledge distillation (KD) is in fact a form of regularization. Perhaps the strongest argument of all for this new perspective comes from its apparent similarities with label smoothing (LS). Here we re-examine this stated equivalence between the two methods by comparing the predictive confidences of the models they train. Experiments on four text classification tasks involving models of different sizes show that: (a) In most settings, KD and LS drive model confidence in completely opposite directions, and (b) In KD, the student inherits not only its knowledge but also its confidence from the teacher, reinforcing the classical knowledge transfer view.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Machine Learning and Algorithms
MethodsKnowledge Distillation · Label Smoothing
