AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

Yehonathan Refael; Jonathan Svirsky; Boris Shustin; Wasim Huleihel; Ofir Lindenbaum

arXiv:2410.17881·cs.LG·December 9, 2025

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

Yehonathan Refael, Jonathan Svirsky, Boris Shustin, Wasim Huleihel, Ofir Lindenbaum

PDF

Open Access

TL;DR

This paper introduces AdaRankGrad, an adaptive low-rank gradient method that reduces memory usage and improves training efficiency for large language models by dynamically adjusting gradient ranks during optimization.

Contribution

We propose a novel adaptive gradient-rank approach that leverages the decreasing rank phenomenon of layer gradients, enabling full-parameter training with lower memory costs.

Findings

01

Reduces memory requirements compared to existing methods.

02

Improves model performance in pretraining and fine-tuning.

03

Provides convergence analysis and empirical validation.

Abstract

Training and fine-tuning large language models (LLMs) come with challenges related to memory and computational requirements due to the increasing size of the model weights and the optimizer states. Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA), which involves introducing a parallel trainable low-rank matrix to the fixed pre-trained weights at each layer. However, these methods often fall short compared to the full-rank weight training approach, as they restrict the parameter search to a low-rank subspace. This limitation can disrupt training dynamics and require a full-rank warm start to mitigate the impact. In this paper, we introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated layer gradients gradually decreases, and asymptotically approaches rank one. Leveraging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Parallel Computing and Optimization Techniques · Machine Learning and Algorithms

MethodsAdam