Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training
Charafeddine Mouzouni

TL;DR
This paper models Mixture-of-Experts token routing as a congestion game, revealing a three-phase load balancing trajectory during training that informs how router behavior evolves from balancing to prioritizing quality.
Contribution
It introduces a theoretical framework for MoE routing dynamics, capturing load balance evolution and providing diagnostics for load prediction and robustness verification.
Findings
Load balance peaks during early training surge phase.
Experts specialize under steady load in stabilization phase.
Router shifts from balancing to prioritizing quality in relaxation phase.
Abstract
We model Mixture-of-Experts (MoE) token routing as a congestion game with a single effective parameter, the congestion coefficient gamma_eff, that quantifies the balance-quality tradeoff. Tracking gamma_eff across training checkpoints of two open-source MoE models, OLMoE-1B-7B (20 checkpoints, with dense sampling in the surge region) and OpenMoE-8B (6 checkpoints), reveals a three-phase trajectory: a surge phase where the router learns to balance load (gamma_eff: 14 to 36-39, peaking in the step 30K-40K region), a stabilization phase where experts specialize under steady balance (B_0: 2.4 to 2.3, steps 100K-400K), and a relaxation phase where the router trades balance for quality as experts differentiate (gamma_eff: 27 to 9, steps 400K-1.2M). This non-monotone trajectory, invisible to post-hoc analysis of converged models, reveals that early MoE training prioritizes balance while late…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
