MixServe: An Automatic Distributed Serving System for MoE Models with Hybrid Parallelism Based on Fused Communication Algorithm
Bowen Zhou, Jinrui Jia, Wenhao He, Yong Zhang, Fang Dong

TL;DR
MixServe is an automatic distributed serving system for MoE models that intelligently selects and optimizes hybrid parallelism strategies, significantly improving inference efficiency on large-scale language models.
Contribution
The paper introduces a novel TP-EP hybrid parallelism with fused communication, and an automatic strategy selection mechanism for efficient MoE model deployment.
Findings
Achieves 1.08~3.80x acceleration in time to first token
Improves throughput by up to 50.3%
Demonstrates superior performance on large-scale models
Abstract
The Mixture of Experts (MoE) models are emerging as the latest paradigm for Large Language Models (LLMs). However, due to memory constraints, MoE models with billions or even trillions of parameters can only be deployed in multi-GPU or even multi-node & multi-GPU based serving systems. Thus, communication has became a major bottleneck in distributed serving systems, especially inter-node communication. Contemporary distributed MoE models are primarily implemented using all-reduce (AR) based tensor parallelism (TP) and all-to-all (A2A) based expert parallelism (EP). However, TP generally exhibits low inter-node efficiency and is thus confined to high-speed intra-node bandwidth. In contrast, EP tends to suffer from load imbalance, especially when the parallel degree is high. In this work, we introduce MixServe, a novel automatic distributed serving system for efficient deployment of MoE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTensor decomposition and applications · Big Data and Digital Economy · Topic Modeling
