FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

Shuyao Qi; Haoyuan Liu; Shizhen Zhao

arXiv:2604.19654·cs.DC·April 22, 2026

FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

Shuyao Qi, Haoyuan Liu, Shizhen Zhao

PDF

TL;DR

FEPLB leverages NVIDIA Hopper's NVLink Copy Engine to enable nearly free intra-node load balancing in MoE training, significantly reducing stragglers without extra communication overhead.

Contribution

It introduces FEPLB, a novel load balancing method that exploits NVLink Copy Engine for efficient intra-node token redistribution in distributed MoE training.

Findings

01

Reduces token straggler by 51-70% on 16 H100 GPUs.

02

Achieves 2x lower token straggler than FasterMoE at EP=8.

03

No measurable communication overhead introduced by FEPLB.

Abstract

Fine-grained, per-micro-batch load balancing is essential for efficient Mixture-of-Experts (MoE) training, yet every prior dynamic scheduling scheme pays for it with extra communication that is hard to hide. Especially on modern bulk-transfer backends such as DeepEP. We make a simple but consequential observation: on the NVIDIA Hopper architecture the NVLink Copy Engine can move data between intra-node GPUs without consuming any SM cycles, effectively providing a nearly free communication channel that runs in parallel with compute kernels. FEPLB turns this idle hardware into a new parallel dimension for MoE load rebalancing. Its Two-Phase Dispatch first routes tokens across nodes via the standard EP backend, then redistributes dynamic-expert tokens and weights within the NVLink domain through the Copy Engine at nearly zero cost, while a lightweight CPU scheduler runs concurrently with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.