Stable-LoRA: Stabilizing Feature Learning of Low-Rank Adaptation

Yize Wu; Ke Gao; Ling Li; Yanjun Wu

arXiv:2603.05204·cs.LG·March 6, 2026

Stable-LoRA: Stabilizing Feature Learning of Low-Rank Adaptation

Yize Wu, Ke Gao, Ling Li, Yanjun Wu

PDF

Open Access 3 Reviews

TL;DR

Stable-LoRA introduces a weight-shrinkage strategy to improve the stability of Low-Rank Adaptation in fine-tuning large language models, addressing theoretical limitations and enhancing performance without extra memory costs.

Contribution

It proposes Stable-LoRA, a novel method that dynamically stabilizes feature learning in LoRA by shrinking low-rank matrices during early training, backed by theoretical analysis and empirical validation.

Findings

01

Stable-LoRA effectively eliminates instability in LoRA feature learning.

02

It outperforms baseline methods across various models and tasks.

03

The approach requires no additional memory and minimal computation overhead.

Abstract

Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient method for fine-tuning Large Langauge Models. It updates the weight matrix as $W = W_{0} + s B A$ , where $W_{0}$ is the original frozen weight, $s$ is a scaling factor and $A$ , $B$ are trainable low-rank matrices. Despite its robust empirical effectiveness, the theoretical foundations of LoRA remain insufficiently understood, particularly with respect to feature learning stability. In this paper, we first establish that, LoRA can, in principle, naturally achieve and sustain stable feature learning (i.e., be self-stabilized) under appropriate hyper-parameters and initializations of $A$ and $B$ . However, we also uncover a fundamental limitation that the necessary non-zero initialization of $A$ compromises self-stability, leading to suboptimal performances. To address this challenge, we propose Stable-LoRA, a weight-shrinkage…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

Please see summary

Weaknesses

Please see summary

Reviewer 02Rating 2Confidence 3

Strengths

* The paper introduces the $\gamma$-function as a novel analytical construct for understanding LoRA’s stability behavior. * Theoretical analysis yields an interpretable condition—$A=B=0$—under which stable feature learning can be achieved. * The proposed progressive-shrink mechanism is a practical solution to approximate the ideal condition $A=B=0$ and address the constratins that in practices, we cannot set both $A$ and $B$ to 0.

Weaknesses

* Definition 1 lacks rigorous justification. While it aligns with prior empirical observations, it represents only one possible stability condition. The framework built upon it may have limited generalizability, which can only be supported by broader empirical studies. * The definition of the $\gamma$-function appears mathematically infeasible. Although it seems inspired by logarithmic properties (e.g., $\log(x) + \log(y) = \log(x \times y)$ and $\log(x + y)$ is dominated by $\max(\log(x), \log

Reviewer 03Rating 6Confidence 4

Strengths

* This paper is solid in its theoretical contribution. The motivation and concepts are well illustrated, making the work easy to follow. The algorithm design is simple yet elegant, and the stability stopping criterion is theoretically justified. * Empirical results demonstrate both the effectiveness and stability of the proposed method. Moreover, it is computationally efficient, introducing only a minor additional runtime overhead. The approach is also compatible with existing LoRA setups witho

Weaknesses

* The experimental settings and details are somewhat limited and unclear. First, how are the experiments on the QA datasets conducted? Is the model fine-tuned on a mixed training dataset and then evaluated on several benchmarks, or is it fine-tuned on one QA dataset and tested accordingly? If it is the latter case, I would suggest conducting additional experiments on general language understanding and dialogue datasets such as WizardLM to better assess the model’s generalization ability. Moreove

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Sparse and Compressive Sensing Techniques