ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models
Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan, Zhuang

TL;DR
ME-Switch introduces a memory-efficient framework for serving multiple large language model experts by combining salient-aware delta compression and domain-based routing, significantly reducing memory usage while maintaining high performance.
Contribution
The paper proposes a novel expert switching framework that reduces memory footprint and improves routing efficiency for large language models through saliency-aware quantization and domain classification.
Findings
Reduces model size by 1.74x for three Mistral-7B models.
Maintains nearly lossless performance on various tasks.
Enables serving 16 Mistral-7B models on a single GPU.
Abstract
LLM development involves pre-training a foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts can pose significant memory challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests can incur substantial I/O costs. Previous approaches decompose the expert weights as the pre-trained weights plus delta weights, followed by quantizing the delta weights using output channel-wise step sizes to reduce the model size. However, these methods overlook the fact that certain input channels of delta weights can cause significant quantization errors at extremely low bitwidths. Additionally, existing methods assume that the appropriate model for a user request is known in advance, which is not the case in practice. To this end, we introduce ME-Switch,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Topic Modeling
