RailS: Load Balancing for All-to-All Communication in Distributed Mixture-of-Experts Training
Heng Xu, Zhiwei Yu, Chengze Du, Ying Zhou, Letian Li, Haojie Wang, Weiqiang Cheng, Jialong Li

TL;DR
RailS is a novel load-balancing framework for distributed Mixture-of-Experts training that exploits Rail topology symmetry and local scheduling to significantly reduce communication bottlenecks and improve training efficiency.
Contribution
RailS introduces a topology-aware, local scheduling load-balancing method that leverages Rail architecture symmetry, achieving near-optimal load balance and substantial performance gains.
Findings
Improves bus bandwidth by 20%–78%.
Reduces iteration time by 17%–78%.
Achieves near-optimal load balance in MoE workloads.
Abstract
Training Mixture-of-Experts (MoE) models introduces sparse and highly imbalanced all-to-all communication that dominates iteration time. Conventional load-balancing methods fail to exploit the deterministic topology of Rail architectures, leaving multi-NIC bandwidth underutilized. We present RailS, a distributed load-balancing framework that minimizes all-to-all completion time in MoE training. RailS leverages the Rail topology's symmetry to prove that uniform sending ensures uniform receiving, transforming global coordination into local scheduling. Each node independently executes a Longest Processing Time First (LPT) spraying scheduler to proactively balance traffic using local information. RailS activates N parallel rails for fine-grained, topology-aware multipath transmission. Across synthetic and real-world MoE workloads, RailS improves bus bandwidth by 20%--78% and reduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAge of Information Optimization · IoT and Edge/Fog Computing · Software-Defined Networks and 5G
