Scaling Exponents Across Parameterizations and Optimizers

Katie Everett; Lechao Xiao; Mitchell Wortsman; Alexander A. Alemi,; Roman Novak; Peter J. Liu; Izzeddin Gur; Jascha Sohl-Dickstein; Leslie Pack; Kaelbling; Jaehoon Lee; Jeffrey Pennington

arXiv:2407.05872·cs.LG·July 17, 2024·1 cites

Scaling Exponents Across Parameterizations and Optimizers

Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi,, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack, Kaelbling, Jaehoon Lee, Jeffrey Pennington

PDF

Open Access 1 Repo

TL;DR

This paper investigates how different parameterizations and optimizers affect model scaling, proposing new theoretical insights, a novel per-layer learning rate method, and a stable Adam variant that improves hyperparameter transfer and numerical stability.

Contribution

It introduces a broader theoretical framework for model scaling, a new per-layer learning rate prescription, and a scale-invariant Adam optimizer eliminating the epsilon hyperparameter.

Findings

01

All parameterizations can achieve hyperparameter transfer.

02

The new per-layer learning rate outperforms muP.

03

Adam-atan2 is a numerically stable, epsilon-free optimizer.

Abstract

Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clankur/muGPT
jax

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Multi-Objective Optimization Algorithms

MethodsSparse Evolutionary Training · Adam