$\phi$-Balancing for Mixture-of-Experts Training

Lizhang Chen; Jonathan Li; Qi Wang; Runlong Liao; Shuozhe Li; Chen Liang; Ni Lao; Qiang Liu

arXiv:2605.15403·cs.LG·May 18, 2026

$\phi$-Balancing for Mixture-of-Experts Training

Lizhang Chen, Jonathan Li, Qi Wang, Runlong Liao, Shuozhe Li, Chen Liang, Ni Lao, Qiang Liu

PDF

TL;DR

The paper introduces $oldsymbol{ ext{ extphi}}$-balancing, a new principled framework for improving expert load balancing in mixture-of-experts models, leading to more stable and effective utilization.

Contribution

It proposes a convex, population-level balancing method with an efficient online algorithm, outperforming prior heuristics in large-scale MoE training.

Findings

01

$oldsymbol{ ext{ extphi}}$-balancing outperforms previous methods in large-scale experiments.

02

The method achieves more stable expert utilization during training.

03

It introduces an efficient EMA-based routing adjustment with negligible overhead.

Abstract

Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $ϕ$ -balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $ϕ$ -balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.