Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices
Pengxiang Zhao, Ping Li, Yingjie Gu, Yi Zheng, Stephan Ludger, K\"olker, Zhefeng Wang, Xiaoming Yuan

TL;DR
Adapprox introduces a memory-efficient Adam optimizer variant using randomized low-rank matrix approximation with adaptive rank selection, improving memory savings, convergence speed, and downstream task performance in large-scale deep learning.
Contribution
It presents a novel adaptive low-rank approximation method for Adam's second moment, balancing accuracy and memory efficiency with optional guidance strategies.
Findings
Achieves 34.5% to 49.9% memory savings on GPT-2 models
Enhances convergence speed compared to AdamW
Improves downstream task performance
Abstract
As deep learning models exponentially increase in size, optimizers such as Adam encounter significant memory consumption challenges due to the storage of first and second moment data. Current memory-efficient methods like Adafactor and CAME often compromise accuracy with their matrix factorization techniques. Addressing this, we introduce Adapprox, a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment. Adapprox features an adaptive rank selection mechanism, finely balancing accuracy and memory efficiency, and includes an optional cosine similarity guidance strategy to enhance stability and expedite convergence. In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5% to 49.9% and 33.8% to 49.9% memory savings for the 117M and 345M models, respectively, with the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetaheuristic Optimization Algorithms Research · Statistical Mechanics and Entropy · Face and Expression Recognition
MethodsAttention Is All You Need · Linear Layer · Weight Decay · Attention Dropout · Residual Connection · Cosine Annealing · Multi-Head Attention · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Dense Connections
