TL;DR
PowerStep is a memory-efficient adaptive optimizer inspired by $\, ext{ extlbrackdbl}p ext{ extbrackdbl}$-norm steepest descent, matching Adam's convergence while halving memory use, suitable for large-scale neural network training.
Contribution
Introduces PowerStep, a novel optimizer that reduces memory overhead by avoiding second-moment storage, with proven convergence and practical effectiveness on large models.
Findings
PowerStep matches Adam's convergence speed.
Halves optimizer memory compared to Adam.
Remains stable with int8 quantization and large models.
Abstract
Adaptive optimizers, most notably Adam, have become the default standard for training large-scale neural networks such as Transformers. These methods maintain running estimates of gradient first and second moments, incurring substantial memory overhead. We introduce PowerStep, a memory-efficient optimizer that achieves coordinate-wise adaptivity without storing second-moment statistics. Motivated by steepest descent under an -norm geometry, we show that applying a nonlinear transform directly to a momentum buffer yields coordinate-wise adaptivity. We prove that PowerStep converges at the optimal rate for non-convex stochastic optimization. Extensive experiments on Transformer models ranging from 124M to 235B parameters demonstrate that PowerStep matches Adam's convergence speed while halving optimizer memory. Furthermore, when combined with aggressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
