MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye and, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal, Yang, Mingjie Tang

TL;DR
MixLoRA introduces a resource-efficient sparse MoE approach based on LoRA, significantly improving multi-task learning performance and reducing memory and computation costs in large language model fine-tuning.
Contribution
The paper proposes MixLoRA, a novel LoRA-based sparse MoE method that enhances multi-task learning performance while reducing resource requirements for LLM fine-tuning.
Findings
MixLoRA achieves about 9% accuracy improvement over state-of-the-art PEFT methods.
Reduces GPU memory consumption by 40%.
Lowers token computation latency by 30%.
Abstract
Fine-tuning Large Language Models (LLMs) is a common practice to adapt pre-trained models for specific applications. While methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multi-task scenarios. In contrast, Mixture-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance in multi-task learning scenarios while maintaining a reduced parameter count. However, the resource requirements of these MoEs remain challenging, particularly for consumer-grade GPUs with less than 24GB memory. To tackle these challenges, we propose MixLoRA, an approach to construct a resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model and employs a commonly used top-k router. Unlike other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
