Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts

Ye Su; Yong Liu

arXiv:2601.03577·cs.LG·January 8, 2026

Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts

Ye Su, Yong Liu

PDF

Open Access

TL;DR

This paper provides a unified Bayesian and information-theoretic framework for understanding mixture-of-experts models, revealing the importance of orthogonality and the challenges of routing as an NP-hard problem.

Contribution

It introduces the first comprehensive theoretical foundation for MoE, linking heuristic practices to optimal sparse approximation and regularization, and highlights the role of orthogonality in improving routing.

Findings

01

Orthogonality regularization improves expert routing efficiency.

02

Routing in MoE is NP-hard, with a coherence barrier affecting optimality.

03

Orthogonality narrows the gap between greedy and optimal expert selection.

Abstract

Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input. Their core mechanisms, Top-k routing and auxiliary load balancing, remain heuristic, however, lacking a cohesive theoretical underpinning to support them. To this end, we build the first unified theoretical framework that rigorously derives these practices as optimal sparse posterior approximation and prior regularization from a Bayesian perspective, while simultaneously framing them as mechanisms to minimize routing ambiguity and maximize channel capacity from an information-theoretic perspective. We also pinpoint the inherent combinatorial hardness of routing, defining it as the NP-hard sparse subset selection problem. We rigorously prove the existence of a "Coherence Barrier"; when expert representations exhibit high mutual coherence, greedy routing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Advanced Graph Neural Networks · Ethics and Social Impacts of AI