Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model
Blake Bordelon, Francesco Mori

TL;DR
This paper develops a theoretical framework for optimal learning rate schedules in a random feature model, revealing distinct easy and hard phases and proposing schedules that outperform benchmarks.
Contribution
It introduces analytically derived optimal LR schedules for a solvable model, including regimes, joint optimization with batch size, and extensions to momentum parameters.
Findings
Optimal schedules differ in easy and hard phases.
Joint optimization of LR and batch size improves training efficiency.
Schedules outperform constant and power-law benchmarks.
Abstract
Setting the learning rate (LR) for a deep learning model is a critical part of successful training. Choosing LRs is often done empirically with trial and error. In this work, we explore a solvable model of optimal LR schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule where is the current iterate and is the training horizon. This schedule is computed both as a numerical optimization problem and also analytically using optimal control theory. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay where and depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
