Taking the Road Less Scheduled with Adaptive Polyak Steps
Dimitris Oikonomou, Matthew Buchholz, Yuen-Man Pun, Robert M. Gower, Nicolas Loizou

TL;DR
This paper introduces adaptive Polyak step sizes for Schedule-Free SGD and Adam, enabling convergence without prior knowledge of problem constants and improving robustness and performance in language modeling tasks.
Contribution
It derives new Polyak-type step sizes that adaptively compute learning rates from sampled data, unifying and enhancing schedule-free optimization methods.
Findings
Achieves $O(1/\sqrt{t})$ last-iterate convergence rate for convex Lipschitz objectives.
Matches or surpasses tuned Schedule-Free baselines in language modeling tasks.
Offers greater robustness to hyperparameter choices in experiments.
Abstract
Schedule-Free SGD, proposed in [Defazio et al., 2024], achieves optimal convergence rates without requiring the training horizon in advance, by replacing learning rate schedules with a principled form of iterate averaging. However, the method still requires tuning a base learning rate whose optimal value depends on unknown problem constants. In this work, we continue down this road by deriving Polyak-type step sizes for Schedule-Free SGD and Adam that compute the learning rate at each iteration from the sampled loss, gradient, and current iterates alone. We first propose an oracle variant that uses per-sample optimal function values and prove an anytime last-iterate rate for convex Lipschitz objectives. We then remove the oracle requirement with a safeguarded variant that replaces the unknown optimal values with any available lower bound, achieving the same rate up to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
