Scaling Exponents Across Parameterizations and Optimizers
Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi,, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack, Kaelbling, Jaehoon Lee, Jeffrey Pennington

TL;DR
This paper investigates how different parameterizations and optimizers affect model scaling, proposing new theoretical insights, a novel per-layer learning rate method, and a stable Adam variant that improves hyperparameter transfer and numerical stability.
Contribution
It introduces a broader theoretical framework for model scaling, a new per-layer learning rate prescription, and a scale-invariant Adam optimizer eliminating the epsilon hyperparameter.
Findings
All parameterizations can achieve hyperparameter transfer.
The new per-layer learning rate outperforms muP.
Adam-atan2 is a numerically stable, epsilon-free optimizer.
Abstract
Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Multi-Objective Optimization Algorithms
MethodsSparse Evolutionary Training · Adam
