Adapprox: Adaptive Approximation in Adam Optimization via Randomized   Low-Rank Matrices

Pengxiang Zhao; Ping Li; Yingjie Gu; Yi Zheng; Stephan Ludger; K\"olker; Zhefeng Wang; Xiaoming Yuan

arXiv:2403.14958·cs.LG·March 25, 2024·1 cites

Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices

Pengxiang Zhao, Ping Li, Yingjie Gu, Yi Zheng, Stephan Ludger, K\"olker, Zhefeng Wang, Xiaoming Yuan

PDF

Open Access

TL;DR

Adapprox introduces a memory-efficient Adam optimizer variant using randomized low-rank matrix approximation with adaptive rank selection, improving memory savings, convergence speed, and downstream task performance in large-scale deep learning.

Contribution

It presents a novel adaptive low-rank approximation method for Adam's second moment, balancing accuracy and memory efficiency with optional guidance strategies.

Findings

01

Achieves 34.5% to 49.9% memory savings on GPT-2 models

02

Enhances convergence speed compared to AdamW

03

Improves downstream task performance

Abstract

As deep learning models exponentially increase in size, optimizers such as Adam encounter significant memory consumption challenges due to the storage of first and second moment data. Current memory-efficient methods like Adafactor and CAME often compromise accuracy with their matrix factorization techniques. Addressing this, we introduce Adapprox, a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment. Adapprox features an adaptive rank selection mechanism, finely balancing accuracy and memory efficiency, and includes an optional cosine similarity guidance strategy to enhance stability and expedite convergence. In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5% to 49.9% and 33.8% to 49.9% memory savings for the 117M and 345M models, respectively, with the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMetaheuristic Optimization Algorithms Research · Statistical Mechanics and Entropy · Face and Expression Recognition

MethodsAttention Is All You Need · Linear Layer · Weight Decay · Attention Dropout · Residual Connection · Cosine Annealing · Multi-Head Attention · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Dense Connections