LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Klaudia Ba{\l}azy, Mohammadreza Banaei, Karl Aberer, Jacek Tabor

TL;DR
LoRA-XS is a new fine-tuning method for large language models that drastically reduces parameter storage and adapts flexibly to various constraints while maintaining or improving accuracy.
Contribution
Introduces LoRA-XS, a theoretically backed, highly parameter-efficient fine-tuning approach that outperforms existing methods in storage and flexibility.
Findings
Over 100x reduction in storage compared to LoRA
Consistently matches or outperforms LoRA and VeRA in benchmarks
Flexible scaling from a single parameter to large modules
Abstract
The growth of large language models underscores the need for parameter-efficient fine-tuning. Despite its popularity, LoRA encounters storage and computational challenges when deploying multiple task- or user-specific modules. To address this, we introduce LoRA-XS, a novel fine-tuning method backed by a theoretical derivation. LoRA-XS drastically reduces trainable parameters by incorporating a small, trainable weight matrix between frozen low-rank matrices derived from the Singular Value Decomposition of pre-trained weights. This design enables LoRA-XS to reduce storage requirements by over 100x in 7B models compared to LoRA. Additionally, unlike other methods, LoRA-XS imposes no lower bound on trainable parameters - it can scale from a single parameter per module to arbitrarily large values, adapting to any storage or computational constraint. Evaluations on GLUE, GSM8K, MATH, and…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- Experiments have been conducted on various domains, and sufficient ablation studies have been performed on the introduced modules. - LoRA-XS can achieve similar performance than LoRA with significantly fewer trainable parameters, which greatly reduces learnable parameters.
W1. Since the input is mapped to a subspace of $W$ and adjusted in scale within that space, if the distribution of the dataset used for fine-tuning differs significantly from that of the pre-training dataset, it may not adapt adequately. W2. In Table 3, although the computational efficiency is promising since the learnable parameters are drastically reduced, the performance decrease is noticeable. For instance, in the case of MATH dataset in Gemma model, the performance decreases from 31.28 whe
* LoRA-XS reduces the number of trainable parameters by over 100x in large-scale models without sacrificing performance, enabling the deployment of millions of personalized models with minimal memory overhead. * LoRA-XS allows for precise control of the number of additional parameters and is independent of model dimensions, providing flexibility in memory usage and being more storage-friendly and adaptable. * LoRA-XS outperforms LoRA and other recent methods like VeRA across various model sizes
* The discussion on related works such as including [1-2] can be further improved, given the rapid progress on LoRA-based parameter efficient fine-tuning of LLMs. * The representation ability of LoRA-XS seems to be weaker than LoRA since the space resulting from LoRA-XS is much smaller than LoRA regarding the dimension. Could this make LoRA-XS easier to overfit than LoRA? Besides, is it possible for LoRA-XS to be harder to learn on more difficult tasks than LoRA? * Cound the authors showcase the
1. The authors propose LoRA-XS, which uses only extremely small trainable parameters for PEFT scenarios.
1. The proposed method aims to learn the tra 2. The results of full parameter fine-tuning for Mixtral and Gemma are directly taken from other resources, and the performances are lower than LoRA FT, which is a little bit strange. 3. LoRA fine-tuning results for LlaMA series are also directly taken from other resources. 4. LLM with LoRA-XS requires more space to store U, V, and Σr, and requires one more matrix multiplication compared to LoRA.
Unlike AdaLoRA that uses SVD to dynamically adjust the rank of each adaptation matrices, LoRA-XS simply uses the top-r singular vectors from SVD of pre-trained weights to construct learnable $r$ x $r$ matrices. The approach is quite simple and straightforward yet accomplishes the following contributions. 1. Improve parameter efficiency while maintaining the model performance Unlike standard LoRA, where trainable parameters scale with the model's hidden dimension, LoRA-XS’s adaptation matrix
1. Computational cost for SVD for each pre-trained weight matrix. Since LoRA-XS freezes the orthogonal matrices for pre-trained $W$, they need to fully compute the SVD instead of approximating them via regularization techniques as AdaLoRA did. The computational load for computing SVD for every and each $W$ would be very heavy as the complexity for SVD is $O(min(d_1, d_2)d_1d_2)$. My concern is that the initial cost from computing SVD might offset the parameter efficiency gains during the adapt
**Originality**: The originality of the paper lies in its novel approach to parameter-efficient fine-tuning for LLMs. LoRA-XS builds upon the foundation of LoRA by incorporating a small, trainable matrix inserted between frozen low-rank matrices derived from Singular Value Decomposition. This innovative strategy introduces an extreme reduction in trainable parameters compared to prior methods like LoRA and VeRA. By making the number of parameters independent of model dimensions, the authors have
There are some weaknesses that impact the overall significance of the work: 1. **Limited Applicability of Memory Savings**: The authors argue that the 100x memory saving benefits the inference of millions of adapters. However, even the most popular model, such as LLaMA3-8B, only has 494 adapters available on Hugging Face ([source](https://huggingface.co/models?other=base_model:adapter:meta-llama/Meta-Llama-3-8B)). This limited usage weakens the practical significance of the proposed memory savi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Image Enhancement Techniques · Advanced Vision and Imaging
MethodsBalanced Selection
