BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for   Fast Training and Inference

Zewen Jin; Shengnan Wang; Jiaan Zhu; Hongrui Zhan; Youhui Bai; Lin; Zhang; Zhenyu Ming; and Cheng Li

arXiv:2502.16927·cs.LG·March 10, 2025

BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference

Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin, Zhang, Zhenyu Ming, and Cheng Li

PDF

1 Video

TL;DR

BigMac introduces a communication-efficient fine-grained MoE structure that significantly reduces training and inference latency while maintaining or improving model performance.

Contribution

It proposes a novel DCCA communication scheme for MoE, replacing traditional methods, leading to lower communication overhead and faster training and inference.

Findings

01

Achieves comparable or better model quality than existing fine-grained MoEs.

02

Reduces training latency by up to 3.09 times.

03

Increases inference throughput by up to 3.11 times.

Abstract

The Mixture-of-Experts (MoE) structure scales the Transformer-based large language models (LLMs) and improves their performance with only the sub-linear increase in computation resources. Recently, a fine-grained DeepSeekMoE structure is proposed, which can further improve the computing efficiency of MoE without performance degradation. However, the All-to-All communication introduced by MoE has become a bottleneck, especially for the fine-grained structure, which typically involves and activates more experts, hence contributing to heavier communication overhead. In this paper, we propose a novel MoE structure named BigMac, which is also fine-grained but with high communication efficiency. The innovation of BigMac is mainly due to that we abandon the \textbf{c}ommunicate-\textbf{d}escend-\textbf{a}scend-\textbf{c}ommunicate (CDAC) manner used by fine-grained MoE, which leads to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference· underline

Taxonomy

MethodsMixture of Experts