When Does Label Smoothing Help?
Rafael M\"uller, Simon Kornblith, Geoffrey Hinton

TL;DR
This paper investigates the effects of label smoothing on neural networks, showing it improves generalization and calibration but hampers knowledge distillation, with insights into how it alters learned representations.
Contribution
It provides empirical analysis of label smoothing's impact on generalization, calibration, and knowledge distillation, and visualizes how it modifies learned representations.
Findings
Label smoothing improves model calibration and generalization.
It reduces the effectiveness of knowledge distillation.
Label smoothing causes class-specific representations to form tight clusters.
Abstract
The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Machine Learning and Data Classification · Music and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Knowledge Distillation · Label Smoothing
