StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
Dingzhi Yu, Rui Pan, Yuxing Liu, Tong Zhang

TL;DR
StoSignSGD introduces structural stochasticity into SignSGD, fixing its divergence issues on non-smooth objectives and demonstrating superior stability and efficiency in training large language models, especially in low-precision settings.
Contribution
The paper proposes StoSignSGD, a novel unbiased sign-based optimizer that resolves SignSGD's divergence on non-smooth objectives and improves convergence in large language model training.
Findings
StoSignSGD achieves convergence rates matching theoretical lower bounds.
In low-precision FP8 pretraining, StoSignSGD outperforms AdamW and SignSGD in stability and speed.
StoSignSGD improves fine-tuning performance on mathematical reasoning tasks for 7B LLMs.
Abstract
Sign-based optimization algorithms, such as SignSGD, have garnered significant attention for their remarkable performance in distributed learning and training large foundation models. Despite their empirical superiority, SignSGD is known to diverge on non-smooth objectives, which are ubiquitous in modern machine learning due to ReLUs, max-pools, and mixture-of-experts. To overcome this fundamental limitation, we propose \textbf{StoSignSGD}, an algorithm that injects structural stochasticity into the sign operator while maintaining an unbiased update step. In the regime of (online) convex optimization, our theoretical analysis shows that StoSignSGD rigorously resolves the non-convergence issues of SignSGD, achieving a sharp convergence rate matching the lower bound. For the more challenging non-convex non-smooth optimization, we introduce generalized stationary measures that encompass…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
