Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning
Kaustubh Ponkshe, Raghav Singhal, Eduard Gorbunov, Alexey Tumanov, Samuel Horvath, Praneeth Vepakomma

TL;DR
This paper introduces LoRA-SB, a low-rank fine-tuning method that approximates full fine-tuning performance with significantly fewer parameters by using a novel initialization strategy and architecture design.
Contribution
We propose LoRA-SB, a new low-rank fine-tuning approach that effectively simulates full fine-tuning, outperforming existing methods in efficiency and accuracy.
Findings
Outperforms LoRA and baselines across multiple tasks
Uses 27-90 times fewer learnable parameters
Achieves near full fine-tuning performance
Abstract
Low-rank adapters have become standard for efficiently fine-tuning large language models, but they often fall short of achieving the performance of full fine-tuning. We propose a method, LoRA Silver Bullet or LoRA-SB, that approximates full fine-tuning within low-rank subspaces using a carefully designed initialization strategy. We theoretically demonstrate that the architecture of LoRA-XS, which inserts a learnable r x r matrix between B and A while keeping other matrices fixed, provides the precise conditions needed for this approximation. We leverage its constrained update space to achieve optimal scaling for high-rank gradient updates while removing the need for scaling factor tuning. We prove that our initialization offers an optimal low-rank approximation of the initial gradient and preserves update directions throughout training. Extensive experiments across mathematical…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is well written, and the formulas are pretty and clear. - Extensive experiments over 16 datasets and 4 models verified the effectiveness of LoRA-SB
- My primary concern relates to the novelty of LoRA-SB. The overall framework of LoRA-SB builds upon LoRA-XS; however, LoRA-XS is not a general efficient LoRA framework, which may limit the generalizability of the proposed method. Additionally, the gradient-based initialization approach and the analysis of equivalent gradients bear strong similarities to LoRA-GA and LoRA-Pro, which further diminishes the originality of the contribution. - The analysis in lines 266-279 appears somewhat redundant,
- LoRA-SB achieves high performance while maintaining a severely constrained parameter count, using 27x to 90x fewer trainable parameters compared to LoRA and its variants across various benchmarks - The method is grounded in theoretical proofs, demonstrating that the SVD-based initialization provides the optimal low-rank approximation of the initial gradient. - The derived optimal gradient updates, combined with the orthonormal fixed bases B and A, inherently lead to scaling factor independence
- Ablation studies show that the initialization is sensitive to noise levels when performing truncated SVD on the averaged update. This suggests the quality of the initial gradient estimation is critical and any practical implementation must ensure robust gradient computation. - Although the initialization is optimal for the first step, the fixed nature of A and B inherently restricts the expressivity of the low-rank updates. While the paper claims updates are preserved throughout training, the
1.This paper provides a clear theoretical analysis of why LoRA-XS suffer from limited update capacity and scaling sensitivity, offering strong motivation for the proposed approach. 2.Proposes a new initialization strategy that aligns the low-rank update space with the first full fine-tuning gradient, leading to better optimization stability, scaling independence, and improved convergence. 3.Extensive experiments across diverse models and tasks show that LoRA-SB delivers stable and significant
1.The paper does not evaluate LoRA-SB against dynamic or adaptive rank approaches (e.g., AdaLoRA, Sparse LoRA). Because LoRA-SB uses a static, fixed-rank initialization, it is inherently incompatible with these flexible rank-allocation strategies. This omission leaves an important gap in the analysis, making it unclear how LoRA-SB compares to adaptive methods under similar parameter budgets. 2.LoRA-SB requires estimating the first full fine-tuning gradient on a subset of data, and the paper set
1. The limitations of LoRA-XS is clearly illustrated, providing sufficient motivation for LoRA-SB. 2. Theoretical analysis guarantees the design and correctness of LoRA-SB. 3. The empirical results demonstrate satisfactory performance compared to other initialization approaches.
1. The theoretical contributions in the paper lack novelty and depth. Lemma 1 follows directly from the definition of $\Delta W$, and Lemma 2 is an immediate consequence of the chain rule. These could be better presented as plain text rather than formal lemmas. Furthermore, Theorems 3 and 4 are straightforward extensions and simplifications of results from LoRA-Pro, while Theorem 6 is directly adapted from LoRA-GA. A more thorough discussion and comparison with existing theoretical works are nee
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadio Frequency Integrated Circuit Design · Advancements in PLL and VCO Technologies · Microwave Engineering and Waveguides
