Towards Robust Scaling Laws for Optimizers

Alexandra Volkova; Mher Safaryan; Christoph H. Lampert; Dan Alistarh

arXiv:2602.07712·cs.LG·February 25, 2026

Towards Robust Scaling Laws for Optimizers

Alexandra Volkova, Mher Safaryan, Christoph H. Lampert, Dan Alistarh

PDF

Open Access

TL;DR

This paper investigates how different optimizers affect the scaling laws of large language model training, proposing a unified law with shared exponents and analyzing the theoretical basis for these laws.

Contribution

It introduces a robust scaling law applicable across optimizers and provides a theoretical explanation for the emergence of scaling laws in gradient-based methods.

Findings

01

Shared exponents improve optimizer comparison

02

New scaling law fits empirical data well

03

Theoretical analysis explains law emergence

Abstract

The quality of Large Language Model (LLM) pretraining depends on multiple factors, including the compute budget and the choice of optimization algorithm. Empirical scaling laws are widely used to predict loss as model size and training data grow, however, almost all existing studies fix the optimizer (typically AdamW). At the same time, a new generation of optimizers (e.g., Muon, Shampoo, SOAP) promises faster and more stable convergence, but their relationship with model and data scaling is not yet well understood. In this work, we study scaling laws across different optimizers. Empirically, we show that 1) separate Chinchilla-style scaling laws for each optimizer are ill-conditioned and have highly correlated parameters. Instead, 2) we propose a more robust law with shared power-law exponents and optimizer-specific rescaling factors, which enable direct comparison between optimizers.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Topic Modeling · Natural Language Processing Techniques