HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
Wenxiang Lin, Xinglin Pan, Lin Zhang, Shaohuai Shi, Xuan Wang, Xiaowen Chu

TL;DR
HierMoE introduces topology-aware token deduplication and expert swap techniques to accelerate MoE transformer training, reducing communication and balancing workloads across GPUs for large language models.
Contribution
The paper presents HierMoE, a novel system that improves MoE training efficiency through theoretical models and practical topology-aware optimizations.
Findings
Achieves up to 3.32x faster communication
Delivers up to 1.27x faster end-to-end training
Outperforms state-of-the-art MoE systems in experiments
Abstract
The sparsely activated mixture-of-experts (MoE) transformer has become a common architecture for large language models (LLMs) due to its sparsity, which requires fewer computational demands while easily scaling the model size. In MoE models, each MoE layer requires to dynamically choose tokens to activate particular experts for computation while the activated experts may not be located in the same device or GPU as the token. However, this leads to substantial communication and load imbalances across all GPUs, which obstructs the scalability of distributed systems within a GPU cluster. To this end, we introduce HierMoE to accelerate the training of MoE models by two topology-aware techniques: 1) token deduplication to reduce the communication traffic, and 2) expert swap to balance the workloads among all GPUs. To enable the above two proposed approaches to be more general, we build…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
