Norm-Bounded Low-Rank Adaptation
Ruigang Wang, Krishnamurthy Dvijotham, Ian R. Manchester

TL;DR
This paper introduces NB-LoRA, a novel low-rank adaptation method with explicit norm bounds, enhancing parameter-efficient fine-tuning robustness and performance in language and vision tasks.
Contribution
It proposes a new norm-bounded low-rank adaptation parameterization that guarantees norm constraints and improves robustness over existing methods.
Findings
Matches or surpasses performance of existing LoRA methods in language tasks.
Enhances robustness to hyper-parameters like rank, learning rate, and epochs.
Reduces catastrophic forgetting in vision fine-tuning.
Abstract
In this work, we propose norm-bounded low-rank adaptation (NB-LoRA) for parameter-efficient fine tuning. NB-LoRA is a novel parameterization of low-rank weight adaptations that admits explicit bounds on each singular value of the adaptation matrix, which can thereby satisfy any prescribed unitarily invariant norm bound, including the Schatten norms (e.g., nuclear, Frobenius, spectral norm). The proposed parameterization is unconstrained, smooth, and complete, i.e. it covers all matrices satisfying the prescribed rank and singular-value bounds. Natural language generation experiments show that NB-LoRA matches or surpasses performance of competing LoRA methods, while exhibiting stronger hyper-parameter robustness. Vision fine-tuning experiments show that NB-LoRA can avoid model catastrophic forgetting without minor cost on adaptation performance, and compared to existing approaches it is…
Peer Reviews
Decision·Submitted to ICLR 2026
1. **Clear motivation.** The paper targets a well-documented limitation of standard LoRA—small initial gradients and sensitivity to learning rate—and proposes a theoretically grounded reparameterization that tightly controls singular values and associated norms. The mapping is smooth and complete, covering all matrices within the prescribed rank and singular-value bounds 2. **Thorough empirical study.** The evaluation spans both LLMs and ViTs, includes comparisons against multiple baselines, abl
1. **Additional Hyperparameters**: NB-LoRA introduces additional control (e.g., the norm bound $\delta$). Although the paper argues for improved robustness, this still expands the tuning surface in practice. And how to balance the performance and robustness by tuning $\delta$ is also a trade-off. 2. **Extra training time / GPU overhead.** The Cayley reparameterization (and its backward pass) adds measurable overhead relative to vanilla LoRA. But it's minor according to the paper.
- Turning norm-constrained low-rank adaptation into a smooth, complete reparameterization is novel and comes with clear theoretical underpinnings. - The paper presents experiments in both language and vision models suggesting both performance gains and reduced forgetting, accompanied by useful analyses of training dynamics
- Gradient-dynamics mismatch in analyses/plots: The method trains *free* parameters $\tilde A,\tilde B$ that map to $A,B$ through a Cayley transformation. Under a given learning rate, the *effective* update on $A,B$ is not simply $-\eta\nabla_{A,B}L$; it is mediated by the Jacobian of the reparameterization. Therefore, comparing raw gradient norms (or update magnitudes) across methods in different parameterizations can be misleading. A fair comparison should report the induced per-step update on
**(S1)** The main insight (Thm. 4.2) is original and non-obvious. I find this an elegant solution. **(S2)** Discussion of related work and positioning relative to it is excellent. The main paper in detail discusses similarities and differences to DeLoRA and PiSSA, including detailed experiments on the attainable norm bounds (Fig. 4b) and which matrices can be learned (Fig. 2). **(S3)** The experiments are on point, to me the main evaluation of robust peft methods is not that they surpass the p
**(W1)** Experiments only consider the high-data regime (MetaMathQA=395K samples, CodeFeedback=66.5K samples). However, to test the robustness of models, it would also be interesting to see if the proposed method makes models more robust to overfitting in low-data regime, for example, through the Dreambooth task used in DeLoRA or similar. **(W2)** Appendices E and F mention that experiments use lr schedulers (cosine for LLMs, one-cycle for ViTs). This is unfortunate because it obscures the effe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods · Neural Networks and Reservoir Computing · Advanced Image Processing Techniques
