Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?
Jihwan Kim, Dogyoon Song, Chulhee Yun

TL;DR
This paper analyzes the scaling behavior of signSGD in linear regression with power-law features, revealing conditions where signSGD outperforms traditional SGD due to unique noise effects and schedule strategies.
Contribution
It introduces a theoretical framework for understanding signSGD's scaling laws in linear models with decaying features and targets, highlighting when it surpasses SGD.
Findings
SignSGD's noise-reshaping can lead to steeper optimal slopes than SGD.
Warmup-stable-decay schedule enhances signSGD performance in certain regimes.
The analysis provides compute-optimal scaling laws for signSGD.
Abstract
We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the…
Peer Reviews
Decision·ICLR 2026 Poster
The paper derives the scaling exponent for SignSGD, a simplified variant of Adam, and empirically demonstrates that the theoretical predictions also hold for Adam.
- The technical part of the paper closely follows the ideas in Xiao et al. (2024) and Paquette et al. (2024). - As far as I understand, the results in the paper depend on choosing a good learning rate, where the optimal rate depends on both the target and feature exponents. However, the authors do not discuss how to tune the learning rate. Discussing this point—and showing how sensitive the results are to suboptimal choices—would strengthen the paper.
* The theoretical argument is rigorous and sound to me. The four-term risk formula decomposition mirrors prior SGD analyses and uncovers signSGD-specific properties in a clear way. This gives immediate intuition for when signSGD can help. * The paper provides further insights, including the comprehensive phase transitions of scaling laws, analysis for compute-optimal slope, and learning rate scheduling that reflects the practice. These analyses serve as a step towards understanding this optimize
* **Limitations in Setting:** There are certain assumptions in this paper, for example, batch size = 1, diagonal covariance, one-pass training, and Gaussian sketching in the PLRF model, that limit direct generalization to more practical pipelines. The paper acknowledges some limitations such as mini-batching and momentum/Adam invariants, but these still limit the novelty to some extent, and results would be stronger with at least partial extensions (or tests) beyond this PLRF setting. * **Discu
1. Extending scaling-law theory to optimizers beyond SGD is a natural and important step, especially given that sign-based methods underlie Adam and its variants, which dominate large-scale model training. The derivation of scaling exponents for signSGD is nontrivial, and the introduction of drift-normalization and noise-reshaping as analytic effects is both insightful and original. 2. The analysis connects optimizer dynamics (sign normalization, step-size scheduling) to compute-optimal scaling
1.The paper’s presentation is algebra-heavy and difficult to parse. While mathematically correct, it often lacks guiding intuition or concrete interpretation of the results in the context of real neural network training. The drift-normalization and noise-reshaping effects, while named, are only partially explained mechanistically. 2. The experiments are limited to synthetic settings (Gaussian-sketch features) and are primarily confirmatory. No evidence is given that the derived scaling laws hold
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Machine Learning and ELM
