A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

X.Y. Han; Yuan Zhong

arXiv:2512.03915·math.OC·April 28, 2026

A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

X.Y. Han, Yuan Zhong

PDF

TL;DR

This paper develops a theoretical framework for analyzing an auxiliary-loss-free load balancing method in sparse mixture-of-experts models, providing insights into its structural properties and online optimization performance.

Contribution

It offers a primal-dual perspective on ALF-LB, deriving structural properties and regret bounds, supported by experiments on large-scale models.

Findings

01

Monotonic improvement condition for the Lagrangian objective

02

Preference rule for balancing expert load

03

Logarithmic expected regret bound in online setting

Abstract

In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of costly GPUs and for the thorough training of architecture parameters across all experts. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a primal-dual method using a single-shot, constant-time update per training iteration for solving an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement condition for the Lagrangian objective, (ii) a preference rule that moves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.