TL;DR
This paper provides the first theoretical convergence analysis of adaptive optimizers like Adam and Muon under floating-point quantization, explaining their effectiveness in low-precision training of large language models.
Contribution
It introduces a novel framework for analyzing adaptive optimizers with quantized components, deriving convergence rates and robustness insights under hardware-aware low-precision settings.
Findings
Adam is sensitive to weight and second-moment quantization due to its reliance on $eta_2 o 1$
Muon is more robust to quantization errors, requiring weaker error control
Quantization errors scale logarithmically with the number of iterations
Abstract
The rapid scaling of large language models (LLMs) has made low-precision training essential for reducing memory, improving efficiency, and enabling larger models and datasets. Existing convergence theories for adaptive optimizers, however, assume all components are exact and neglect hardware-aware quantization, leaving open the question of why low-precision training remains effective. We introduce the first theoretical framework for analyzing the convergence of adaptive optimizers, including Adam and Muon, under floating-point quantization of gradients, weights, and optimizer states (e.g., moment estimates). Within this framework, we derive convergence rates on smooth non-convex objectives under standard stochastic gradient assumptions, explicitly characterizing how quantization errors from different components affect convergence. We show that both algorithms retain rates close to their…
Peer Reviews
Decision·ICLR 2026 Poster
complete quantization error analysis under certain settings, both for Adam and Muon
Basically experiments are limited, theory is not that informative either in proving practicality. The novelty and contribution is limited. - line 402 "the second moment (qV) is stricter than for the first moment (qM)" -> there are well known fact existing works e.g., https://arxiv.org/abs/2405.03637 and https://arxiv.org/abs/2405.03637 - many missing connection with stochastic rounding work where it give unbiased estimation, but brining higher variance. In modern low bit training like 4 bits, S
**Rigorous analysis and theoretical insights that align with recent practice.** Providing clear statements (Th. 4.5 and Th. 4.6), the work explains why Muon tolerates quantization better than Adam --- mostly due to an important assumption of $\beta_2\to1$ in the Adam analysis. This theoretical insight matches practitioners’ observations [1], narrowing the theory–empirical gap. **Empiritical validation confirms theory.** Experiments on the Rosenbrock function and small fully connected models co
**Missing research on convergence of matrix-based optimizers, leaving a room for improvement.** Unlike Kovalev et al. [2] who handles constrained/composite and star-convex settings, or Shen et al. [3], who exploits Hessian structure in several assumptions, the presented theory is only for unconstrained smooth non-convex functions. Also discussions with the results on constrained / unconstrained LMO optimization [4] --- resulting in the Scion optimizer --- would benefit the theoretical flavor of
- The paper extends prior work with a rigorous convergence analysis of adaptive optimizers under quantization of gradients, weights, and momentum terms. - It proposes a quantization schedule that aligns the behavior of quantized optimizers with their full-precision counterparts, offering insights into the sensitivity of different components to quantization error. - The inclusion of the Muon optimizer broadens the analysis and enhances the paper’s practical relevance.
The paper omits key references and makes inaccurate claims about prior work. For instance, - [Hou et al. 2019] analyzed not only SGD but also adaptive optimizers such as Adam under weight and gradient quantization (without the first-order momentum term, i.e., $\beta_1=0$). As shown by [D´efossez et al. 2022], omitting momentum only introduces a multiplicative slowdown term, which should be acknowledged unless the new quantization error model changes this relationship. - Another closely relate
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
