StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

Dingzhi Yu; Rui Pan; Yuxing Liu; Tong Zhang

arXiv:2604.15416·cs.LG·April 20, 2026

StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

Dingzhi Yu, Rui Pan, Yuxing Liu, Tong Zhang

PDF

TL;DR

StoSignSGD introduces structural stochasticity into SignSGD, fixing its divergence issues on non-smooth objectives and demonstrating superior stability and efficiency in training large language models, especially in low-precision settings.

Contribution

The paper proposes StoSignSGD, a novel unbiased sign-based optimizer that resolves SignSGD's divergence on non-smooth objectives and improves convergence in large language model training.

Findings

01

StoSignSGD achieves convergence rates matching theoretical lower bounds.

02

In low-precision FP8 pretraining, StoSignSGD outperforms AdamW and SignSGD in stability and speed.

03

StoSignSGD improves fine-tuning performance on mathematical reasoning tasks for 7B LLMs.

Abstract

Sign-based optimization algorithms, such as SignSGD, have garnered significant attention for their remarkable performance in distributed learning and training large foundation models. Despite their empirical superiority, SignSGD is known to diverge on non-smooth objectives, which are ubiquitous in modern machine learning due to ReLUs, max-pools, and mixture-of-experts. To overcome this fundamental limitation, we propose \textbf{StoSignSGD}, an algorithm that injects structural stochasticity into the sign operator while maintaining an unbiased update step. In the regime of (online) convex optimization, our theoretical analysis shows that StoSignSGD rigorously resolves the non-convergence issues of SignSGD, achieving a sharp convergence rate matching the lower bound. For the more challenging non-convex non-smooth optimization, we introduce generalized stationary measures that encompass…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.