Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
Shuqing Luo, Pingzhi Li, Jie Peng, Hanrui Wang, Yang (Katie) Zhao, Yu (Kevin) Cao, Yu Cheng, Tianlong Chen

TL;DR
Occult introduces system- and algorithm-level innovations to reduce communication overhead in Mixture-of-Experts models, significantly accelerating training and inference while maintaining model quality.
Contribution
The paper proposes novel methods to optimize collaborative communication in MoE models, enabling faster training and inference with reduced communication costs.
Findings
Achieves over 1.5x speedup in training and inference.
Maintains comparable or better model quality.
Effective communication reduction through collaboration pruning.
Abstract
Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over runtime in large-scale training). In this paper, we first define collaborative communication to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them "collaborated", which comprises cases as intra- and inter-collaboration, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsMixture of Experts · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
