LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

Thomas Robert; Mher Safaryan; Ionut-Vlad Modoranu; Dan; Alistarh

arXiv:2410.16103·cs.LG·March 4, 2025

LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

Thomas Robert, Mher Safaryan, Ionut-Vlad Modoranu, Dan, Alistarh

PDF

Open Access 1 Repo 3 Reviews

TL;DR

LDAdam is a memory-efficient adaptive optimizer that uses low-dimensional gradient statistics for training large models, enabling effective fine-tuning and pre-training with reduced memory usage.

Contribution

The paper introduces LDAdam, a novel optimizer that performs adaptive steps in low-dimensional subspaces while maintaining exploration of the full parameter space, with proven convergence.

Findings

01

Reduces memory footprint significantly compared to traditional optimizers.

02

Achieves accurate fine-tuning and pre-training of language models.

03

Provides convergence guarantees under standard assumptions.

Abstract

We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models. Code is available at…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 2

Strengths

The paper is overall easy to follow, the idea is interesting, and seems to empirically works great.

Weaknesses

I see that the authors provide some running times and memory metrics in Appendix B, but it seems to me that this part should be more exhaustive in order to prove the proposed algorithm useful. I know this is hard to monitor for algorithms on GPU, but could authors provide graphs similar to Figure 1, but with running times? and memory usage? with ADAM, GaLore, and LDADAM (proposed). In other words, could authors provide graphs with perplexity as a function of runtime and perplexity as a function

Reviewer 02Rating 8Confidence 3

Strengths

The paper is well-written and the contribution is clear. The algorithm is novel with theoretical convergence. The experiments indicates that the new approach is performing not only better than GaLore, but also better than the vanilla Adam for some parameter settings.

Weaknesses

(Please respond to the "Questions" directly) The computational time of proposed method is not clear; Some discussion on the theoretical results is needed.

Reviewer 03Rating 6Confidence 3

Strengths

1. **Projection-Aware Update Rule**: LDAdam uses a projection-aware update rule to transition between subspaces. This rule allows it to estimate the gradient statistics even after dimensional reduction, adapting efficiently to changes in the subspace basis without losing essential gradient information. 2. **Generalized Error Feedback Mechanism**: To address inaccuracies from low-rank projections, LDAdam introduces a unique error feedback mechanism that accounts for both gradient and optimizer s

Weaknesses

Liang et al. [1] provide a convergence analysis for GaLore and related algorithms without relying on a 'stable-rank' assumption. Since LDAdam aligns with this framework, I recommend citing [1] for its convergence insights. Additionally, a comparison of LDAdam’s convergence analysis with [1], specifically on the impact of removing the 'stable-rank' assumption, would help clarify the authors’ theoretical contributions. [1] Memory-Efficient LLM Training with Online Subspace Descent, Kaizhao Liang,

Code & Models

Repositories

IST-DASLab/LDAdam
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques