MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

Zheng Zhang; Donglin Yang; Yaqi Xia; Liang Ding; Dacheng Tao; Xiaobo Zhou; Dazhao Cheng

arXiv:2506.22175·cs.DC·June 30, 2025

MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

Zheng Zhang, Donglin Yang, Yaqi Xia, Liang Ding, Dacheng Tao, Xiaobo Zhou, Dazhao Cheng

PDF

TL;DR

MPipeMoE introduces an adaptive, memory-efficient pipeline parallelism approach for MoE training, significantly improving speed and reducing memory usage in large pre-trained models.

Contribution

The paper proposes MPipeMoE, a novel library that enhances MoE training efficiency through adaptive pipeline parallelism and memory reuse strategies, addressing communication and memory challenges.

Findings

01

Achieves up to 2.8x speedup over existing methods.

02

Reduces memory footprint by up to 47%.

03

Effective on large-scale MoE models in a multi-node cluster.

Abstract

Recently, Mixture-of-Experts (MoE) has become one of the most popular techniques to scale pre-trained models to extraordinarily large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption. In this paper, we present the design and implementation of MPipeMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages, we design adaptive pipeline parallelism with an online…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.