Hexa-MoE: Efficient and Heterogeneous-aware Training for   Mixture-of-Experts

Shuqing Luo; Jie Peng; Pingzhi Li; Hanrui Wang; and Tianlong Chen

arXiv:2411.01288·cs.DC·April 3, 2025

Hexa-MoE: Efficient and Heterogeneous-aware Training for Mixture-of-Experts

Shuqing Luo, Jie Peng, Pingzhi Li, Hanrui Wang, and Tianlong Chen

PDF

Open Access 1 Repo

TL;DR

HEXA-MoE introduces a heterogeneous-aware training framework for Mixture-of-Experts models, significantly improving efficiency by reducing memory use and speeding up training on diverse hardware.

Contribution

It proposes expert-specific operators and adaptive configurations to enhance MoE training efficiency on heterogeneous devices.

Findings

01

Reduces memory consumption by 10-48%.

02

Achieves 0.5-4.3x speedup over state-of-the-art MoE libraries.

03

Effectively minimizes latency on heterogeneous hardware.

Abstract

Mixture-of-Experts (MoE) has emerged as a practical approach to scale up parameters for the Transformer model to achieve better generalization while maintaining a sub-linear increase in computation overhead. Current MoE models are mainly built with expert parallelism on distributed devices. However, it usually depends on homogeneous devices to deploy and suffers from heavy communication overhead and computation redundancy. In this paper, we explore developing a \texttt{H}eterogeneous-aware \texttt{EX}pert \texttt{A}llocation framework, \textbf{\texttt{HEXA-MoE}}, with significantly enhanced computing efficiency. It contains two components: ( $1$ ) \textit{Expert-Specific Operators}. We replace the typical general matrix multiplication or grouped matrix multiplication interfaces with our operators, which allows the computing to be performed in an in-place manner with \textbf{ZERO}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

unites-lab/hexa-moe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Distributed and Parallel Computing Systems