MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs
Xinfeng Xia, Jiacheng Liu, Xiaofeng Hou, Peng Tang, Mingxuan Zhang, Wenfeng Wang, Chao Li

TL;DR
MoE-Prism introduces a co-design approach that transforms monolithic MoE experts into fine-grained, elastic components, enabling dynamic adaptation to diverse service requirements and significantly improving resource efficiency and performance.
Contribution
It presents a novel two-phase methodology for deconstructing monolithic experts into sub-experts and deploying QoS-aware scheduling for elastic MoE services, without retraining.
Findings
Over 4 times more stable operating points achieved.
Up to 19.9% throughput improvement under latency constraints.
Up to 10.36% latency reduction with resource limits.
Abstract
Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by sparsely activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a "quality cliff", offering only a few coarse-grained operating points. This inflexibility forces a difficult trade-off between cost and quality, preventing adaptation to diverse Service Level Objectives (SLOs) and leading to significant resource over-provisioning. This paper introduces MoE-Prism, a model-system co-design that transforms rigid MoE models into elastic services. Our methodology is divided into two phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs monolithic experts into fine-grained "sub-experts." This engine employs a partitioning optimization solver that uses a metaheuristic-based approach to group neurons,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · IoT and Edge/Fog Computing · Cloud Computing and Resource Management
