SoftSignSGD(S3): An Enhanced Optimizer for Practical DNN Training and Loss Spikes Minimization Beyond Adam

Hanyang Peng; Shuang Qin; Yue Yu; Fangqing Jiang; Hui Wang; Wen Gao

arXiv:2507.06464·cs.LG·July 10, 2025

SoftSignSGD(S3): An Enhanced Optimizer for Practical DNN Training and Loss Spikes Minimization Beyond Adam

Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, Wen Gao

PDF

Open Access

TL;DR

This paper introduces SignSoftSGD (S3), a new optimizer that enhances Adam by reducing loss spikes and accelerating convergence through a flexible momentum scheme, leading to faster and more stable deep neural network training.

Contribution

S3 generalizes sign-like updates with a $p$-th order momentum, minimizes loss spikes with unified EMA coefficients, and incorporates Nesterov acceleration, offering theoretical convergence guarantees and practical improvements.

Findings

01

S3 converges faster and more stably than Adam and AdamW.

02

S3 reduces loss spikes even with larger learning rates.

03

S3 achieves comparable or better performance with fewer training steps.

Abstract

Adam has proven remarkable successful in training deep neural networks, but the mechanisms underlying its empirical successes and limitations remain underexplored. In this study, we demonstrate that the effectiveness of Adam stems largely from its similarity to SignSGD in robustly handling large gradient fluctuations, yet it is also vulnerable to destabilizing loss spikes due to its uncontrolled update scaling. To enhance the advantage of Adam and mitigate its limitation, we propose SignSoftSGD (S3), a novel optimizer with three key innovations. \emph{First}, S3 generalizes the sign-like update by employing a flexible $p$ -th order momentum ( $p \geq 1$ ) in the denominator, departing from the conventional second-order momentum (variance) preconditioning. This design enables enhanced performance while achieving stable training even with aggressive learning rates. \emph{Second}, S3…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis