Knowledge Distillation $\approx$ Label Smoothing: Fact or Fallacy?

Md Arafat Sultan

arXiv:2301.12609·cs.LG·October 26, 2023

Knowledge Distillation $\approx$ Label Smoothing: Fact or Fallacy?

Md Arafat Sultan

PDF

Open Access

TL;DR

This paper critically examines the relationship between knowledge distillation and label smoothing, revealing they often produce opposite confidence behaviors and reaffirming KD as a knowledge transfer method rather than just regularization.

Contribution

The study provides empirical evidence that KD and LS differ fundamentally in how they influence model confidence, challenging the view that KD is merely a form of regularization.

Findings

01

KD and LS often have opposite effects on model confidence

02

KD involves inheriting confidence from the teacher model

03

Experiments conducted on four text classification tasks

Abstract

Originally proposed as a method for knowledge transfer from one model to another, some recent studies have suggested that knowledge distillation (KD) is in fact a form of regularization. Perhaps the strongest argument of all for this new perspective comes from its apparent similarities with label smoothing (LS). Here we re-examine this stated equivalence between the two methods by comparing the predictive confidences of the models they train. Experiments on four text classification tasks involving models of different sizes show that: (a) In most settings, KD and LS drive model confidence in completely opposite directions, and (b) In KD, the student inherits not only its knowledge but also its confidence from the teacher, reinforcing the classical knowledge transfer view.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Machine Learning and Algorithms

MethodsKnowledge Distillation · Label Smoothing