Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise
Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang

TL;DR
This paper provides a theoretical framework explaining why sign-based optimizers like Lion outperform traditional methods in training large language models, especially under heavy-tailed gradient noise.
Contribution
It introduces a generalized heavy-tailed noise model and establishes convergence rates for sign-based optimizers, bridging the gap between theory and empirical success.
Findings
Sign-based optimizers outperform variance-adapted methods in heavy-tailed noise settings.
Theoretical convergence rates match or surpass previous bounds under the new noise model.
Empirical LLM pretraining experiments support the theoretical insights.
Abstract
While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
