Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

Egor Shulgin; Dimitri von R\"utte; Tianyue H. Zhang; Niccol\`o Ajroldi; Bernhard Sch\"olkopf; Antonio Orvieto

arXiv:2603.15958·cs.LG·March 18, 2026

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

Egor Shulgin, Dimitri von R\"utte, Tianyue H. Zhang, Niccol\`o Ajroldi, Bernhard Sch\"olkopf, Antonio Orvieto

PDF

Open Access

TL;DR

This paper develops hyperparameter scaling laws for modern optimizers using convergence bounds, providing a unified, principled framework that recovers existing insights and suggests new strategies for optimal training performance.

Contribution

It introduces a novel approach to derive hyperparameter scaling laws via convergence bounds, unifying and extending prior empirical and theoretical insights.

Findings

01

Closed-form power-law schedules for hyperparameters

02

Unified perspective on existing scaling laws

03

Insights into momentum and batch-size interactions

Abstract

Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating bounds in recent literature as a proxy and minimizing them across different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as functions of the iteration or token budget. Our analysis, holding model size…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Machine Learning in Materials Science · Stochastic Gradient Optimization Techniques