Mixture-of-Experts under Finite-Rate Gating: Communication--Generalization Trade-offs

Ali Khalesi; Mohammad Reza Deylam Salehi

arXiv:2602.15091·stat.ML·March 27, 2026

Mixture-of-Experts under Finite-Rate Gating: Communication--Generalization Trade-offs

Ali Khalesi, Mohammad Reza Deylam Salehi

PDF

Open Access

TL;DR

This paper models Mixture-of-Experts architectures as communication channels with finite information rates, deriving bounds that reveal trade-offs between gating capacity, model expressivity, and generalization performance.

Contribution

It introduces an information-theoretic framework for analyzing MoE gating, providing capacity-aware limits and a rate-distortion characterization of finite-rate gating.

Findings

01

Empirical validation of the trade-offs between gating rate and generalization.

02

Derivation of a mutual-information based generalization bound.

03

Numerical simulations confirming theoretical predictions.

Abstract

Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, {we specialize a mutual-information generalization bound and develop a rate-distortion characterization $D (R_{g})$ of finite-rate gating, where $R_{g} := I (X; T)$ , yielding (under a standard empirical rate-distortion optimality condition) $E [R (W)] \leq D (R_{g}) + δ_{m} + (2/ m) I (S; W)$ . }The analysis yields capacity-aware limits for communication-constrained MoE systems, and numerical simulations on synthetic multi-expert models empirically confirm the predicted trade-offs between gating rate, expressivity, and generalization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAge of Information Optimization · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research