APOLLO: SGD-like Memory, AdamW-level Performance
Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas, Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee

TL;DR
APOLLO introduces a memory-efficient optimizer for large language model training that matches or surpasses AdamW's performance while significantly reducing memory usage, enabling faster and more scalable training on various hardware.
Contribution
The paper proposes APOLLO, a novel optimizer that approximates AdamW's learning rate adaptation with low-rank updates, greatly reducing memory overhead while maintaining high training performance.
Findings
APOLLO achieves comparable or better performance than AdamW.
APOLLO reduces optimizer memory by nearly eliminating AdamW's optimizer states.
Supports larger batch sizes and training on low-end GPUs with minimal memory.
Abstract
Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques
MethodsAdamW · Adaptive Parameter-wise Diagonal Quasi-Newton Method
