Mixture-of-Experts under Finite-Rate Gating: Communication--Generalization Trade-offs
Ali Khalesi, Mohammad Reza Deylam Salehi

TL;DR
This paper models Mixture-of-Experts architectures as communication channels with finite information rates, deriving bounds that reveal trade-offs between gating capacity, model expressivity, and generalization performance.
Contribution
It introduces an information-theoretic framework for analyzing MoE gating, providing capacity-aware limits and a rate-distortion characterization of finite-rate gating.
Findings
Empirical validation of the trade-offs between gating rate and generalization.
Derivation of a mutual-information based generalization bound.
Numerical simulations confirming theoretical predictions.
Abstract
Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, {we specialize a mutual-information generalization bound and develop a rate-distortion characterization of finite-rate gating, where , yielding (under a standard empirical rate-distortion optimality condition) . }The analysis yields capacity-aware limits for communication-constrained MoE systems, and numerical simulations on synthetic multi-expert models empirically confirm the predicted trade-offs between gating rate, expressivity, and generalization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAge of Information Optimization · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research
