MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang,, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang

TL;DR
This paper introduces MoRA, a high-rank updating method for large language models that improves upon LoRA by enabling more effective learning and memorization, especially in memory-intensive tasks.
Contribution
MoRA employs a square matrix with non-parameter operators for high-rank updates, maintaining parameter efficiency while enhancing learning capacity.
Findings
MoRA outperforms LoRA on memory-intensive tasks.
MoRA achieves comparable performance on other tasks.
The method allows weight merging for deployment like LoRA.
Abstract
Low-rank adaptation is a popular parameter-efficient fine-tuning method for large language models. In this paper, we analyze the impact of low-rank updating, as implemented in LoRA. Our findings suggest that the low-rank updating mechanism may limit the ability of LLMs to effectively learn and memorize new knowledge. Inspired by this observation, we propose a new method called MoRA, which employs a square matrix to achieve high-rank updating while maintaining the same number of trainable parameters. To achieve it, we introduce the corresponding non-parameter operators to reduce the input dimension and increase the output dimension for the square matrix. Furthermore, these operators ensure that the weight can be merged back into LLMs, which makes our method can be deployed like LoRA. We perform a comprehensive evaluation of our method across five tasks: instruction tuning, mathematical…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The authors motivate their approach by a clever memorization task. They show that vanilla LoRA with low rank (e.g. rank 8) does poorly on a contrived memorization task. They train on 10k unique IDs for 100 epochs where the LLM has to memorize the value associated with a key. This is a nice way to motivate the rest of the results. The authors do rigorous benchmarking across training approaches, from instruction finetuning (IFT) and continual pretraining to pretraining from scratch. This is not s
1. The authors note that “increasing rank alleviates this problem” of memorizing UUID pairs. Although MoRA converges faster than LoRA at rank 256, they both end up converging. What are the benefits of using MoRA rank 8 vs. LoRA rank 256? According to Table 2, they get roughly the same performance. If MoRA is a strict pareto improvement over LoRA for the same number of parameters, the paper should state so unambiguously (the paper states this in a few separate places for specific tasks). If ther
1. The paper provides a detailed analysis of the limitations of LoRA in tasks that require memorizing new knowledge. 2. The proposed high-rank updating method shows superior performance in memorization tasks compared to standard LoRA.
1. In sec 5.1 and sec 5.3, MoRA is only compared with standard LoRA, more baselines(e.g., [1][2]) are needed to substantiate your hypothesis that high-rank updating benefits memory-intensive tasks. 2. It appears that the experiments in Section 5.2 are aimed at demonstrating MoRA's superiority in memory-intensive tasks and its comparable performance on other tasks. However, results indicate that MoRA underperforms other baselines in mathematical reasoning, which I consider memory-intensive, as in
1. This method, MoRA, is well-motivated by identifying weaknesses in LoRA and other low-rank adaptation methods. An interesting and intuitive experiment on memorizing UUID pairs further highlights the necessity of developing methods to address this limitation. 2. Experimental results on Continual Pretraining demonstrate MoRA's effectiveness in this scenario. 3. The provided code is also well-written.
1. The primary advantage of MoRA lies in tasks requiring significant enhancement of LLM knowledge and capabilities, such as memorizing UUID pairs and continual pretraining (CP). However, it may be more practical to directly use full finetuning in such cases, particularly for CP, which generally demands high computational resources to achieve optimal performance. This raises questions about the suitability of applying PEFT in such high-cost scenarios. As shown in Table 2, MoRA cannot outperform f
The authors observe that LoRA’s low-rank updates (e.g., using a rank as low as 8) may not fully leverage the model's learning potential. To address this, they propose restructuring LoRA updates by compressing the input dimension, increasing the rank of the update matrix, and then decompressing the output to match the original dimensions. This approach maintains the same number of trainable parameters while increasing the rank, which is an interesting idea. However, while this method enhances lea
1. Unclear Compression and Decompression Operations: The compression and decompression operations described in equations (8) and following are not clearly explained. The rationale behind choosing these specific operations is unclear and requires further elaboration. 2. Rotation Operation: While the rotation operation is likely intended to preserve positional information after compressing the input, the explanation provided in the paper is not sufficiently clear. The authors need to clarify why
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques · Advanced Data Compression Techniques · Advanced SAR Imaging Techniques
