Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts
Ye Su, Yong Liu

TL;DR
This paper provides a unified Bayesian and information-theoretic framework for understanding mixture-of-experts models, revealing the importance of orthogonality and the challenges of routing as an NP-hard problem.
Contribution
It introduces the first comprehensive theoretical foundation for MoE, linking heuristic practices to optimal sparse approximation and regularization, and highlights the role of orthogonality in improving routing.
Findings
Orthogonality regularization improves expert routing efficiency.
Routing in MoE is NP-hard, with a coherence barrier affecting optimality.
Orthogonality narrows the gap between greedy and optimal expert selection.
Abstract
Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input. Their core mechanisms, Top-k routing and auxiliary load balancing, remain heuristic, however, lacking a cohesive theoretical underpinning to support them. To this end, we build the first unified theoretical framework that rigorously derives these practices as optimal sparse posterior approximation and prior regularization from a Bayesian perspective, while simultaneously framing them as mechanisms to minimize routing ambiguity and maximize channel capacity from an information-theoretic perspective. We also pinpoint the inherent combinatorial hardness of routing, defining it as the NP-hard sparse subset selection problem. We rigorously prove the existence of a "Coherence Barrier"; when expert representations exhibit high mutual coherence, greedy routing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Advanced Graph Neural Networks · Ethics and Social Impacts of AI
