Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?
Keshigeyan Chandrasegaran, Ngoc-Trung Tran, Yunqing Zhao, Ngai-Man, Cheung

TL;DR
This paper explores the conflicting findings on the compatibility of label smoothing and knowledge distillation, identifying systematic diffusion as the key factor that explains these contradictions and guiding practical recommendations.
Contribution
It introduces the concept of systematic diffusion to understand the incompatibility between label smoothing and knowledge distillation, supported by extensive experiments across tasks and architectures.
Findings
Systematic diffusion explains the reduced effectiveness of KD from LS-trained teachers.
Using low-temperature transfer with LS-trained teachers improves student performance.
The study provides practical guidelines for combining LS and KD effectively.
Abstract
This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). Contemporary findings addressing this thesis statement take dichotomous standpoints: Muller et al. (2019) and Shen et al. (2021b). Critically, there is no effort to understand and resolve these contradictory findings, leaving the primal question -- to smooth or not to smooth a teacher network? -- unanswered. The main contributions of our work are the discovery, analysis and validation of systematic diffusion as the missing concept which is instrumental in understanding and resolving these contradictory findings. This systematic diffusion essentially curtails the benefits of distilling from an LS-trained teacher, thereby rendering KD at increased temperatures ineffective. Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Anomaly Detection Techniques and Applications
MethodsDiffusion · Knowledge Distillation · Label Smoothing
