Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Aditya Ranganath

TL;DR
This survey reviews recent advances in optimizer design for large language models, emphasizing efficiency, stability, and comprehensive benchmarking to guide future research.
Contribution
It categorizes optimizer types for LLMs, discusses benchmarking methodologies, and advocates for rigorous, scale-aware comparisons in optimizer research.
Findings
Organized literature into categories like first-order, adaptive, and matrix-based optimizers.
Highlighted the importance of comprehensive benchmarking including hyperparameters and efficiency.
Argued for a shift towards rigorous, scale-aware evaluation of optimizers.
Abstract
Training large language models requires optimization algorithms that are not only statistically effective, but also computationally and memory efficient at extreme scale. Although Adam remains the dominant optimizer for large-scale language-model pretraining and fine-tuning, recent work has revisited nearly every component of the optimization stack: adaptive moment estimation, decoupled weight decay, memory footprint, curvature approximation, sign-based updates, large-batch stability, low-rank gradient structure, and matrix-wise orthogonalized updates. This survey reviews optimizer design for large language models through a systems-and-optimization lens. We organize the literature into classical first-order optimizers, adaptive optimizers, memory-efficient variants, second-order and curvature-aware methods, sign-based and discovered optimizers, low-rank and projection-based methods, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
