Towards Understanding Label Smoothing
Yi Xu, Yuanhong Xu, Qi Qian, Hao Li, Rong Jin

TL;DR
This paper provides a theoretical analysis of label smoothing regularization (LSR) in deep neural network training, showing how it accelerates convergence by reducing variance and proposing a two-stage LSR strategy that improves training efficiency.
Contribution
It offers the first convergence analysis of LSR in non-convex optimization and introduces a two-stage LSR method that enhances training speed and effectiveness.
Findings
LSR reduces variance and speeds up convergence in stochastic gradient descent.
Two-stage LSR (TSLA) improves training efficiency by applying LSR early and dropping it later.
Empirical results show TSLA outperforms baseline methods on ResNet training.
Abstract
Label smoothing regularization (LSR) has a great success in training deep neural networks by stochastic algorithms such as stochastic gradient descent and its variants. However, the theoretical understanding of its power from the view of optimization is still rare. This study opens the door to a deep understanding of LSR by initiating the analysis. In this paper, we analyze the convergence behaviors of stochastic gradient descent with label smoothing regularization for solving non-convex problems and show that an appropriate LSR can help to speed up the convergence by reducing the variance. More interestingly, we proposed a simple yet effective strategy, namely Two-Stage LAbel smoothing algorithm (TSLA), that uses LSR in the early training epochs and drops it off in the later training epochs. We observe from the improved convergence result of TSLA that it benefits from LSR in the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · 1x1 Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Bottleneck Residual Block · Batch Normalization · Average Pooling · Max Pooling · Global Average Pooling · Residual Connection · Label Smoothing
