Hexa-MoE: Efficient and Heterogeneous-aware Training for Mixture-of-Experts
Shuqing Luo, Jie Peng, Pingzhi Li, Hanrui Wang, and Tianlong Chen

TL;DR
HEXA-MoE introduces a heterogeneous-aware training framework for Mixture-of-Experts models, significantly improving efficiency by reducing memory use and speeding up training on diverse hardware.
Contribution
It proposes expert-specific operators and adaptive configurations to enhance MoE training efficiency on heterogeneous devices.
Findings
Reduces memory consumption by 10-48%.
Achieves 0.5-4.3x speedup over state-of-the-art MoE libraries.
Effectively minimizes latency on heterogeneous hardware.
Abstract
Mixture-of-Experts (MoE) has emerged as a practical approach to scale up parameters for the Transformer model to achieve better generalization while maintaining a sub-linear increase in computation overhead. Current MoE models are mainly built with expert parallelism on distributed devices. However, it usually depends on homogeneous devices to deploy and suffers from heavy communication overhead and computation redundancy. In this paper, we explore developing a \texttt{H}eterogeneous-aware \texttt{EX}pert \texttt{A}llocation framework, \textbf{\texttt{HEXA-MoE}}, with significantly enhanced computing efficiency. It contains two components: () \textit{Expert-Specific Operators}. We replace the typical general matrix multiplication or grouped matrix multiplication interfaces with our operators, which allows the computing to be performed in an in-place manner with \textbf{ZERO}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Distributed and Parallel Computing Systems
