FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
Shuyao Qi, Haoyuan Liu, Shizhen Zhao

TL;DR
FEPLB leverages NVIDIA Hopper's NVLink Copy Engine to enable nearly free intra-node load balancing in MoE training, significantly reducing stragglers without extra communication overhead.
Contribution
It introduces FEPLB, a novel load balancing method that exploits NVLink Copy Engine for efficient intra-node token redistribution in distributed MoE training.
Findings
Reduces token straggler by 51-70% on 16 H100 GPUs.
Achieves 2x lower token straggler than FasterMoE at EP=8.
No measurable communication overhead introduced by FEPLB.
Abstract
Fine-grained, per-micro-batch load balancing is essential for efficient Mixture-of-Experts (MoE) training, yet every prior dynamic scheduling scheme pays for it with extra communication that is hard to hide. Especially on modern bulk-transfer backends such as DeepEP. We make a simple but consequential observation: on the NVIDIA Hopper architecture the NVLink Copy Engine can move data between intra-node GPUs without consuming any SM cycles, effectively providing a nearly free communication channel that runs in parallel with compute kernels. FEPLB turns this idle hardware into a new parallel dimension for MoE load rebalancing. Its Two-Phase Dispatch first routes tokens across nodes via the standard EP backend, then redistributes dynamic-expert tokens and weights within the NVLink domain through the Copy Engine at nearly zero cost, while a lightweight CPU scheduler runs concurrently with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
