MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence
Ionut-Vlad Modoranu, Mher Safaryan, Grigory Malinovsky, Eldar Kurtic,, Thomas Robert, Peter Richtarik, Dan Alistarh

TL;DR
MicroAdam is a memory-efficient variant of Adam that compresses gradient information with error feedback, maintaining convergence guarantees and practical performance on large-scale models like BERT and LLaMA.
Contribution
We introduce MicroAdam, a novel optimizer that reduces memory overhead through gradient compression with error feedback, while preserving convergence guarantees.
Findings
MicroAdam achieves significant memory savings on large models.
It maintains convergence comparable to Adam and AMSGrad.
It runs efficiently on GPUs for billion-scale models.
Abstract
We propose a new variant of the Adam optimizer called MicroAdam that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed into the optimizer state, thereby reducing its memory footprint significantly. We control the resulting compression error via a novel instance of the classical \emph{error feedback} mechanism from distributed optimization in which *the error correction information is itself compressed* to allow for practical memory gains. We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance. Specifically, we show that MicroAdam can be implemented efficiently on GPUs: on both million-scale (BERT) and billion-scale (LLaMA) models, MicroAdam provides practical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMetaheuristic Optimization Algorithms Research
MethodsAMSGrad · Adam
