HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap

Wenxiang Lin; Xinglin Pan; Lin Zhang; Shaohuai Shi; Xuan Wang; Xiaowen Chu

arXiv:2508.09591·cs.DC·August 14, 2025

HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap

Wenxiang Lin, Xinglin Pan, Lin Zhang, Shaohuai Shi, Xuan Wang, Xiaowen Chu

PDF

TL;DR

HierMoE introduces topology-aware token deduplication and expert swap techniques to accelerate MoE transformer training, reducing communication and balancing workloads across GPUs for large language models.

Contribution

The paper presents HierMoE, a novel system that improves MoE training efficiency through theoretical models and practical topology-aware optimizations.

Findings

01

Achieves up to 3.32x faster communication

02

Delivers up to 1.27x faster end-to-end training

03

Outperforms state-of-the-art MoE systems in experiments

Abstract

The sparsely activated mixture-of-experts (MoE) transformer has become a common architecture for large language models (LLMs) due to its sparsity, which requires fewer computational demands while easily scaling the model size. In MoE models, each MoE layer requires to dynamically choose tokens to activate particular experts for computation while the activated experts may not be located in the same device or GPU as the token. However, this leads to substantial communication and load imbalances across all GPUs, which obstructs the scalability of distributed systems within a GPU cluster. To this end, we introduce HierMoE to accelerate the training of MoE models by two topology-aware techniques: 1) token deduplication to reduce the communication traffic, and 2) expert swap to balance the workloads among all GPUs. To enable the above two proposed approaches to be more general, we build…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.