Higher Layers Need More LoRA Experts
Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun and, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, VS, Subrahmanian

TL;DR
This paper introduces MoLA, a layer-wise expert allocation method for LoRA in Transformer models, which improves performance and efficiency by assigning more experts to higher layers, demonstrating superior results on NLP benchmarks.
Contribution
The paper proposes a novel layer-wise expert allocation strategy for LoRA in MoE models, enhancing efficiency and performance in parameter-efficient tuning.
Findings
Allocating more LoRA experts to higher layers improves model performance.
MoLA outperforms baselines with fewer parameters.
Layer-wise expert configuration is effective across NLP benchmarks.
Abstract
Parameter-efficient tuning (PEFT) techniques like low-rank adaptation (LoRA) offer training efficiency on Large Language Models, but their impact on model performance remains limited. Recent efforts integrate LoRA and Mixture-of-Experts (MoE) to improve the performance of PEFT methods. Despite promising results, research on improving the efficiency of LoRA with MoE is still in its early stages. Recent studies have shown that experts in the MoE architecture have different strengths and also exhibit some redundancy. Does this statement also apply to parameter-efficient MoE? In this paper, we introduce a novel parameter-efficient MoE method, \textit{\textbf{M}oE-L\textbf{o}RA with \textbf{L}ayer-wise Expert \textbf{A}llocation (MoLA)} for Transformer-based models, where each model layer has the flexibility to employ a varying number of LoRA experts. We investigate several architectures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Multimodal Machine Learning Applications
