SoftSignSGD(S3): An Enhanced Optimizer for Practical DNN Training and Loss Spikes Minimization Beyond Adam
Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, Wen Gao

TL;DR
This paper introduces SignSoftSGD (S3), a new optimizer that enhances Adam by reducing loss spikes and accelerating convergence through a flexible momentum scheme, leading to faster and more stable deep neural network training.
Contribution
S3 generalizes sign-like updates with a $p$-th order momentum, minimizes loss spikes with unified EMA coefficients, and incorporates Nesterov acceleration, offering theoretical convergence guarantees and practical improvements.
Findings
S3 converges faster and more stably than Adam and AdamW.
S3 reduces loss spikes even with larger learning rates.
S3 achieves comparable or better performance with fewer training steps.
Abstract
Adam has proven remarkable successful in training deep neural networks, but the mechanisms underlying its empirical successes and limitations remain underexplored. In this study, we demonstrate that the effectiveness of Adam stems largely from its similarity to SignSGD in robustly handling large gradient fluctuations, yet it is also vulnerable to destabilizing loss spikes due to its uncontrolled update scaling. To enhance the advantage of Adam and mitigate its limitation, we propose SignSoftSGD (S3), a novel optimizer with three key innovations. \emph{First}, S3 generalizes the sign-like update by employing a flexible -th order momentum () in the denominator, departing from the conventional second-order momentum (variance) preconditioning. This design enables enhanced performance while achieving stable training even with aggressive learning rates. \emph{Second}, S3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
