TL;DR
BigMac introduces a communication-efficient fine-grained MoE structure that significantly reduces training and inference latency while maintaining or improving model performance.
Contribution
It proposes a novel DCCA communication scheme for MoE, replacing traditional methods, leading to lower communication overhead and faster training and inference.
Findings
Achieves comparable or better model quality than existing fine-grained MoEs.
Reduces training latency by up to 3.09 times.
Increases inference throughput by up to 3.11 times.
Abstract
The Mixture-of-Experts (MoE) structure scales the Transformer-based large language models (LLMs) and improves their performance with only the sub-linear increase in computation resources. Recently, a fine-grained DeepSeekMoE structure is proposed, which can further improve the computing efficiency of MoE without performance degradation. However, the All-to-All communication introduced by MoE has become a bottleneck, especially for the fine-grained structure, which typically involves and activates more experts, hence contributing to heavier communication overhead. In this paper, we propose a novel MoE structure named BigMac, which is also fine-grained but with high communication efficiency. The innovation of BigMac is mainly due to that we abandon the \textbf{c}ommunicate-\textbf{d}escend-\textbf{a}scend-\textbf{c}ommunicate (CDAC) manner used by fine-grained MoE, which leads to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsMixture of Experts
