Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference

Shuqing Luo; Pingzhi Li; Jie Peng; Hanrui Wang; Yang (Katie) Zhao; Yu (Kevin) Cao; Yu Cheng; Tianlong Chen

arXiv:2505.13345·cs.LG·May 20, 2025

Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference

Shuqing Luo, Pingzhi Li, Jie Peng, Hanrui Wang, Yang (Katie) Zhao, Yu (Kevin) Cao, Yu Cheng, Tianlong Chen

PDF

Open Access 1 Repo

TL;DR

Occult introduces system- and algorithm-level innovations to reduce communication overhead in Mixture-of-Experts models, significantly accelerating training and inference while maintaining model quality.

Contribution

The paper proposes novel methods to optimize collaborative communication in MoE models, enabling faster training and inference with reduced communication costs.

Findings

01

Achieves over 1.5x speedup in training and inference.

02

Maintains comparable or better model quality.

03

Effective communication reduction through collaboration pruning.

Abstract

Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over $40%$ runtime in large-scale training). In this paper, we first define collaborative communication to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them "collaborated", which comprises $2$ cases as intra- and inter-collaboration, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

unites-lab/occult
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsMixture of Experts · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings