When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
Hongyi Tao, Dingzhi Yu, Lijun Zhang

TL;DR
This paper provides a theoretical analysis explaining when and why SignSGD outperforms SGD, especially under sparse noise conditions, supported by empirical results on training a GPT-2 model.
Contribution
It introduces a novel theoretical framework based on $\, ext{l}_1$-norm analysis, characterizes the problem class where SignSGD excels, and extends the analysis to matrix optimizers like Muon.
Findings
SignSGD reduces complexity by a factor of $d$ under sparse noise.
Theoretical bounds match empirical faster convergence of SignSGD.
Extending sign operators to matrices preserves optimal scaling with dimensionality.
Abstract
Sign-based optimization algorithms, such as SignSGD and Muon, have garnered significant attention for their remarkable performance in training large foundation models. Despite this empirical success, we still lack a theoretical understanding of when and why these sign-based methods outperform vanilla SGD. The core obstacle is that under standard smoothness and finite variance conditions, SGD is known to be minimax optimal for finding stationary points measured by -norms, thereby fundamentally precluding any complexity gains for sign-based methods in standard settings. To overcome this barrier, we analyze sign-based optimizers leveraging -norm stationarity, -smoothness, and a separable noise model, which can better capture the coordinate-wise nature of signed updates. Under this distinct problem geometry, we derive matched upper and lower bounds for SignSGD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
