Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation
Binghui Chen, Weihong Deng, Junping Du

TL;DR
This paper introduces Noisy Softmax, a method that injects annealed noise into the softmax function during training to delay saturation, enhance exploration, and improve CNN generalization.
Contribution
It proposes a novel noise injection technique in softmax to mitigate early saturation, promoting better exploration and generalization in CNN training.
Findings
Improves CNN generalization on benchmark datasets.
Achieves state-of-the-art or competitive results.
Enhances exploration during training by delaying softmax saturation.
Abstract
Over the past few years, softmax and SGD have become a commonly used component and the default training strategy in CNN frameworks, respectively. However, when optimizing CNNs with SGD, the saturation behavior behind softmax always gives us an illusion of training well and then is omitted. In this paper, we first emphasize that the early saturation behavior of softmax will impede the exploration of SGD, which sometimes is a reason for model converging at a bad local-minima, then propose Noisy Softmax to mitigating this early saturation issue by injecting annealed noise in softmax during each iteration. This operation based on noise injection aims at postponing the early saturation and further bringing continuous gradients propagation so as to significantly encourage SGD solver to be more exploratory and help to find a better local-minima. This paper empirically verifies the superiority…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsSoftmax · Stochastic Gradient Descent
