A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models
X.Y. Han, Yuan Zhong

TL;DR
This paper develops a theoretical framework for analyzing an auxiliary-loss-free load balancing method in sparse mixture-of-experts models, providing insights into its structural properties and online optimization performance.
Contribution
It offers a primal-dual perspective on ALF-LB, deriving structural properties and regret bounds, supported by experiments on large-scale models.
Findings
Monotonic improvement condition for the Lagrangian objective
Preference rule for balancing expert load
Logarithmic expected regret bound in online setting
Abstract
In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of costly GPUs and for the thorough training of architecture parameters across all experts. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a primal-dual method using a single-shot, constant-time update per training iteration for solving an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement condition for the Lagrangian objective, (ii) a preference rule that moves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
