APOLLO: SGD-like Memory, AdamW-level Performance

Hanqing Zhu; Zhenyu Zhang; Wenyan Cong; Xi Liu; Sem Park; Vikas; Chandra; Bo Long; David Z. Pan; Zhangyang Wang; Jinwon Lee

arXiv:2412.05270·cs.LG·February 18, 2025

APOLLO: SGD-like Memory, AdamW-level Performance

Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas, Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee

PDF

Open Access 1 Repo 5 Models

TL;DR

APOLLO introduces a memory-efficient optimizer for large language model training that matches or surpasses AdamW's performance while significantly reducing memory usage, enabling faster and more scalable training on various hardware.

Contribution

The paper proposes APOLLO, a novel optimizer that approximates AdamW's learning rate adaptation with low-rank updates, greatly reducing memory overhead while maintaining high training performance.

Findings

01

APOLLO achieves comparable or better performance than AdamW.

02

APOLLO reduces optimizer memory by nearly eliminating AdamW's optimizer states.

03

Supports larger batch sizes and training on low-end GPUs with minimal memory.

Abstract

Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhuhanqing/APOLLO
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsAdamW · Adaptive Parameter-wise Diagonal Quasi-Newton Method