MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based   Mixture of Experts

Dengchun Li; Yingzi Ma; Naizheng Wang; Zhengmao Ye and; Zhiyuan Cheng; Yinghao Tang; Yan Zhang; Lei Duan; Jie Zuo; Cal; Yang; Mingjie Tang

arXiv:2404.15159·cs.CL·July 23, 2024·2 cites

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye and, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal, Yang, Mingjie Tang

PDF

Open Access 2 Repos 1 Models

TL;DR

MixLoRA introduces a resource-efficient sparse MoE approach based on LoRA, significantly improving multi-task learning performance and reducing memory and computation costs in large language model fine-tuning.

Contribution

The paper proposes MixLoRA, a novel LoRA-based sparse MoE method that enhances multi-task learning performance while reducing resource requirements for LLM fine-tuning.

Findings

01

MixLoRA achieves about 9% accuracy improvement over state-of-the-art PEFT methods.

02

Reduces GPU memory consumption by 40%.

03

Lowers token computation latency by 30%.

Abstract

Fine-tuning Large Language Models (LLMs) is a common practice to adapt pre-trained models for specific applications. While methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multi-task scenarios. In contrast, Mixture-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance in multi-task learning scenarios while maintaining a reduced parameter count. However, the resource requirements of these MoEs remain challenging, particularly for consumer-grade GPUs with less than 24GB memory. To tackle these challenges, we propose MixLoRA, an approach to construct a resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model and employs a commonly used top-k router. Unlike other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
TUDB-Labs/alpaca-mixlora-7b
model· ♡ 4
♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections