MoRE: A Mixture of Low-Rank Experts for Adaptive Multi-Task Learning
Dacao Zhang, Kun Zhang, Shimao Chu, Le Wu, Xin Li, Si Wei

TL;DR
This paper introduces MoRE, a mixture of low-rank experts, to improve multi-task fine-tuning of large language models by adaptively selecting low-rank modules, enhancing performance without extra inference cost.
Contribution
The paper proposes a novel MoRE framework that aligns different low-rank experts with tasks and uses an adaptive rank selector, improving multi-task PEFT efficiency and effectiveness.
Findings
MoRE outperforms traditional LoRA in multi-task benchmarks.
MoRE achieves significant performance gains without additional inference cost.
The adaptive rank selector effectively chooses the appropriate expert for each task.
Abstract
With the rapid development of Large Language Models (LLMs), Parameter-Efficient Fine-Tuning (PEFT) methods have gained significant attention, which aims to achieve efficient fine-tuning of LLMs with fewer parameters. As a representative PEFT method, Low-Rank Adaptation (LoRA) introduces low-rank matrices to approximate the incremental tuning parameters and achieves impressive performance over multiple scenarios. After that, plenty of improvements have been proposed for further improvement. However, these methods either focus on single-task scenarios or separately train multiple LoRA modules for multi-task scenarios, limiting the efficiency and effectiveness of LoRA in multi-task scenarios. To better adapt to multi-task fine-tuning, in this paper, we propose a novel Mixture of Low-Rank Experts (MoRE) for multi-task PEFT. Specifically, instead of using an individual LoRA for each task, we…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The idea is interesting. Rather than allocating dedicated experts through separate low-rank adapters, this work reuses a single low-rank matrix, employing a gating function to select portions of it as distinct experts. This approach effectively models diverse tasks in multi-task training. 2. The experimental results are promising. The proposed method consistently outperforms baselines across most datasets, with particularly strong performance in the LLAMA setting.
1. The contribution over existing mixture-of-LoRA methods (such as MixLoRA and MOELoRA) appears limited. This work can be interpreted as a specific case within existing frameworks, where each expert has a rank of 1, and a condition is enforced such that, when selecting expert k, all preceding experts (1 to k-1) are also selected. 2. While the proposed method reuses the adapter matrix and claims that this reduces the number of parameters, making it independent of the number of experts, several c
(1)The topic of multi-task PEFT is important. (2)Even though there are many typos appear in the current version, the paper is easy to follow. (3) To use a gate function to define the selection of appropriate ranks for different tasks is interesting. (4) The experiments are convinced
(1) The author should double-check their writings. For example: in Line 161, The target is to learn a shared model F with parameters θ to satisfy the requirements of different (?) simultaneously. And in Line 243 “One step further, during the backward pass, the arg max in Eq.(3) is non-differentiable” obviously, argmax appears in eq(4). (2) The authors claim that a smaller learning rate will benefit the training of MoRE, it would be better to test different learning rate in the extensive experim
This paper is well-written, and the statement is clear. This method improves previous LoRA MOE methods and provides some gains. The experiments show promising improvements compared to the baselines reported.
1. This paper only evaluates on GLUE benchmarks, where most of the datasets mainly focus on natural language processing instead of commonsense and complicated reasoning. The reasoning and knowledge capabilities of the LLMs are still under explored in this paper. 2. MoE-based LoRA uses more parameters than standard LoRA methods and this LoRA matrix cannot be merged into the original LLMs during inference. Therefore, during inference, the model requires more memory to perform inference and could
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Mobile Crowdsensing and Crowdsourcing
MethodsFocus · ALIGN
