QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

HamidReza Imani; Jiaxin Peng; Peiman Mohseni; Abdolah Amirany; Tarek El-Ghazawi

arXiv:2505.06481·cs.LG·May 13, 2025

QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

HamidReza Imani, Jiaxin Peng, Peiman Mohseni, Abdolah Amirany, Tarek El-Ghazawi

PDF

Open Access 1 Video

TL;DR

This paper introduces a system for efficiently serving multiple mixture-of-expert large language models on a single GPU by sharing similar experts and dynamically reconfiguring non-expert layers, reducing memory usage and improving throughput.

Contribution

It proposes similarity-based expert consolidation and runtime partial reconfiguration to enable scalable, high-quality multi-model serving on a single GPU.

Findings

01

Achieves 85% reduction in turnaround time compared to multi-instance GPU.

02

Maintains output quality across multiple model variants.

03

Demonstrates scalability with up to four model variants.

Abstract

The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single-GPU. We propose a serving system that employs \textit{similarity-based expert consolidation} to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce \textit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves a competitive output quality while maintaining throughput comparable to serving a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration· slideslive

Taxonomy

TopicsBig Data and Digital Economy · Topic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Switch FFN · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax