LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics
Thomas Robert, Mher Safaryan, Ionut-Vlad Modoranu, Dan, Alistarh

TL;DR
LDAdam is a memory-efficient adaptive optimizer that uses low-dimensional gradient statistics for training large models, enabling effective fine-tuning and pre-training with reduced memory usage.
Contribution
The paper introduces LDAdam, a novel optimizer that performs adaptive steps in low-dimensional subspaces while maintaining exploration of the full parameter space, with proven convergence.
Findings
Reduces memory footprint significantly compared to traditional optimizers.
Achieves accurate fine-tuning and pre-training of language models.
Provides convergence guarantees under standard assumptions.
Abstract
We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models. Code is available at…
Peer Reviews
Decision·ICLR 2025 Poster
The paper is overall easy to follow, the idea is interesting, and seems to empirically works great.
I see that the authors provide some running times and memory metrics in Appendix B, but it seems to me that this part should be more exhaustive in order to prove the proposed algorithm useful. I know this is hard to monitor for algorithms on GPU, but could authors provide graphs similar to Figure 1, but with running times? and memory usage? with ADAM, GaLore, and LDADAM (proposed). In other words, could authors provide graphs with perplexity as a function of runtime and perplexity as a function
The paper is well-written and the contribution is clear. The algorithm is novel with theoretical convergence. The experiments indicates that the new approach is performing not only better than GaLore, but also better than the vanilla Adam for some parameter settings.
(Please respond to the "Questions" directly) The computational time of proposed method is not clear; Some discussion on the theoretical results is needed.
1. **Projection-Aware Update Rule**: LDAdam uses a projection-aware update rule to transition between subspaces. This rule allows it to estimate the gradient statistics even after dimensional reduction, adapting efficiently to changes in the subspace basis without losing essential gradient information. 2. **Generalized Error Feedback Mechanism**: To address inaccuracies from low-rank projections, LDAdam introduces a unique error feedback mechanism that accounts for both gradient and optimizer s
Liang et al. [1] provide a convergence analysis for GaLore and related algorithms without relying on a 'stable-rank' assumption. Since LDAdam aligns with this framework, I recommend citing [1] for its convergence insights. Additionally, a comparison of LDAdam’s convergence analysis with [1], specifically on the impact of removing the 'stable-rank' assumption, would help clarify the authors’ theoretical contributions. [1] Memory-Efficient LLM Training with Online Subspace Descent, Kaizhao Liang,
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques
