MixServe: An Automatic Distributed Serving System for MoE Models with Hybrid Parallelism Based on Fused Communication Algorithm

Bowen Zhou; Jinrui Jia; Wenhao He; Yong Zhang; Fang Dong

arXiv:2601.08800·cs.DC·January 14, 2026

MixServe: An Automatic Distributed Serving System for MoE Models with Hybrid Parallelism Based on Fused Communication Algorithm

Bowen Zhou, Jinrui Jia, Wenhao He, Yong Zhang, Fang Dong

PDF

Open Access

TL;DR

MixServe is an automatic distributed serving system for MoE models that intelligently selects and optimizes hybrid parallelism strategies, significantly improving inference efficiency on large-scale language models.

Contribution

The paper introduces a novel TP-EP hybrid parallelism with fused communication, and an automatic strategy selection mechanism for efficient MoE model deployment.

Findings

01

Achieves 1.08~3.80x acceleration in time to first token

02

Improves throughput by up to 50.3%

03

Demonstrates superior performance on large-scale models

Abstract

The Mixture of Experts (MoE) models are emerging as the latest paradigm for Large Language Models (LLMs). However, due to memory constraints, MoE models with billions or even trillions of parameters can only be deployed in multi-GPU or even multi-node & multi-GPU based serving systems. Thus, communication has became a major bottleneck in distributed serving systems, especially inter-node communication. Contemporary distributed MoE models are primarily implemented using all-reduce (AR) based tensor parallelism (TP) and all-to-all (A2A) based expert parallelism (EP). However, TP generally exhibits low inter-node efficiency and is thus confined to high-speed intra-node bandwidth. In contrast, EP tends to suffer from load imbalance, especially when the parallel degree is high. In this work, we introduce MixServe, a novel automatic distributed serving system for efficient deployment of MoE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Big Data and Digital Economy · Topic Modeling