E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology
Qingjun Zhang

TL;DR
This paper introduces a dimensionless parameter E that predicts the health of Mixture-of-Experts models, enabling better training diagnostics and reducing the need for auxiliary load-balancing losses.
Contribution
The authors propose E as a universal control parameter for MoE models, validated across multiple datasets and revealing insights into expert ecology and training dynamics.
Findings
E >= 0.5 guarantees zero dead experts
Dead experts can be revived through router re-exploration
Ecological health is temperature-invariant across a wide range
Abstract
We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate -- triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
