Stable-LoRA: Stabilizing Feature Learning of Low-Rank Adaptation
Yize Wu, Ke Gao, Ling Li, Yanjun Wu

TL;DR
Stable-LoRA introduces a weight-shrinkage strategy to improve the stability of Low-Rank Adaptation in fine-tuning large language models, addressing theoretical limitations and enhancing performance without extra memory costs.
Contribution
It proposes Stable-LoRA, a novel method that dynamically stabilizes feature learning in LoRA by shrinking low-rank matrices during early training, backed by theoretical analysis and empirical validation.
Findings
Stable-LoRA effectively eliminates instability in LoRA feature learning.
It outperforms baseline methods across various models and tasks.
The approach requires no additional memory and minimal computation overhead.
Abstract
Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient method for fine-tuning Large Langauge Models. It updates the weight matrix as , where is the original frozen weight, is a scaling factor and , are trainable low-rank matrices. Despite its robust empirical effectiveness, the theoretical foundations of LoRA remain insufficiently understood, particularly with respect to feature learning stability. In this paper, we first establish that, LoRA can, in principle, naturally achieve and sustain stable feature learning (i.e., be self-stabilized) under appropriate hyper-parameters and initializations of and . However, we also uncover a fundamental limitation that the necessary non-zero initialization of compromises self-stability, leading to suboptimal performances. To address this challenge, we propose Stable-LoRA, a weight-shrinkage…
Peer Reviews
Decision·ICLR 2026 Poster
Please see summary
Please see summary
* The paper introduces the $\gamma$-function as a novel analytical construct for understanding LoRA’s stability behavior. * Theoretical analysis yields an interpretable condition—$A=B=0$—under which stable feature learning can be achieved. * The proposed progressive-shrink mechanism is a practical solution to approximate the ideal condition $A=B=0$ and address the constratins that in practices, we cannot set both $A$ and $B$ to 0.
* Definition 1 lacks rigorous justification. While it aligns with prior empirical observations, it represents only one possible stability condition. The framework built upon it may have limited generalizability, which can only be supported by broader empirical studies. * The definition of the $\gamma$-function appears mathematically infeasible. Although it seems inspired by logarithmic properties (e.g., $\log(x) + \log(y) = \log(x \times y)$ and $\log(x + y)$ is dominated by $\max(\log(x), \log
* This paper is solid in its theoretical contribution. The motivation and concepts are well illustrated, making the work easy to follow. The algorithm design is simple yet elegant, and the stability stopping criterion is theoretically justified. * Empirical results demonstrate both the effectiveness and stability of the proposed method. Moreover, it is computationally efficient, introducing only a minor additional runtime overhead. The approach is also compatible with existing LoRA setups witho
* The experimental settings and details are somewhat limited and unclear. First, how are the experiments on the QA datasets conducted? Is the model fine-tuned on a mixed training dataset and then evaluated on several benchmarks, or is it fine-tuned on one QA dataset and tested accordingly? If it is the latter case, I would suggest conducting additional experiments on general language understanding and dialogue datasets such as WizardLM to better assess the model’s generalization ability. Moreove
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Sparse and Compressive Sensing Techniques
