QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
HamidReza Imani, Jiaxin Peng, Peiman Mohseni, Abdolah Amirany, Tarek El-Ghazawi

TL;DR
This paper introduces a system for efficiently serving multiple mixture-of-expert large language models on a single GPU by sharing similar experts and dynamically reconfiguring non-expert layers, reducing memory usage and improving throughput.
Contribution
It proposes similarity-based expert consolidation and runtime partial reconfiguration to enable scalable, high-quality multi-model serving on a single GPU.
Findings
Achieves 85% reduction in turnaround time compared to multi-instance GPU.
Maintains output quality across multiple model variants.
Demonstrates scalability with up to four model variants.
Abstract
The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single-GPU. We propose a serving system that employs \textit{similarity-based expert consolidation} to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce \textit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves a competitive output quality while maintaining throughput comparable to serving a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBig Data and Digital Economy · Topic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Switch FFN · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax
