Deriving Hyperparameter Scaling Laws via Modern Optimization Theory
Egor Shulgin, Dimitri von R\"utte, Tianyue H. Zhang, Niccol\`o Ajroldi, Bernhard Sch\"olkopf, Antonio Orvieto

TL;DR
This paper develops hyperparameter scaling laws for modern optimizers using convergence bounds, providing a unified, principled framework that recovers existing insights and suggests new strategies for optimal training performance.
Contribution
It introduces a novel approach to derive hyperparameter scaling laws via convergence bounds, unifying and extending prior empirical and theoretical insights.
Findings
Closed-form power-law schedules for hyperparameters
Unified perspective on existing scaling laws
Insights into momentum and batch-size interactions
Abstract
Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating bounds in recent literature as a proxy and minimizing them across different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as functions of the iteration or token budget. Our analysis, holding model size…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning in Materials Science · Stochastic Gradient Optimization Techniques
