The Expressive Power of Low-Rank Adaptation
Yuchen Zeng, Kangwook Lee

TL;DR
This paper provides a theoretical analysis of Low-Rank Adaptation (LoRA), demonstrating its expressive power in fine-tuning neural networks and establishing rank thresholds for accurate model adaptation.
Contribution
It offers the first theoretical insights into LoRA's expressive capabilities, including rank conditions for approximating target models in neural networks and Transformers.
Findings
LoRA can adapt any fully connected network to a smaller target model with sufficient rank.
Quantifies approximation errors when LoRA rank is below the threshold.
Any Transformer model can be adapted with rank proportional to embedding size.
Abstract
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model to accurately represent any smaller target model if LoRA-rank . We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with…
Peer Reviews
Decision·ICLR 2024 poster
1. This paper claims that they are the first to study the expressive power of Low-Rank Adaptation (LoRA) for different model architectures. So, if this is true (I do not have sufficient knowledge to check), the novel of this paper is significant. 2. Their theoretical results align well with the recent advances of LoRA on LLMs. 3. Not only FNN but TFN is explored with the both theoretical and emperical study.
(1) From Figure 1, I can see that LoRA of FNN performs on par with gradient update, whereas LoRA of TFNs significantly outperform gradient updates. Could the author explain this performance difference? (2) It is impressive that LoRA with rank=1 can match the performance of gradient update in Figure 3. Does this mean the gradient update does not actually learn well?
- The study conducts a thorough analysis of the expressive capabilities of LoRA, underpinned by a set of well-founded assumptions. - The findings from this research offer a theoretical foundation for applying LoRA to a diverse range of models, including Transformers and Diffusion models, and furnish insights on how to select hyper-parameters for designing LoRA effectively. - The insights provided by this work can streamline the design process for LoRA, especially when the depth and width of th
The experimental approach raises significant concerns. Given the widespread application of LoRA to various large language models (LLMs), such as LLaMA, there's an opportunity for the authors to substantiate their findings using models tasked with different challenges. Considering the availability of various model sizes in LLaMA and the comprehensive range of results provided by the original LoRA study, a comparison between the proposed theoretical analysis and empirical observations of LoRA woul
This is a theoretically strong paper, studying a very timely topic. While empirically, LoRA has been shown to do surprisingly well, a theoretical explanation for why has been missing. This paper is a good starting point in understanding how/why/when LoRA works.
While it is okay to not have them in this paper, I think it would be interesting to study other effects of LoRA theoretically. For example, how does LoRA affect generalization? What can we say about how fast LoRA can converge even if the target model can eventually be found by LoRA exactly.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and ELM · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Residual Connection · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization
