Generalization and Scaling Laws for Mixture-of-Experts Transformers
Mansour Zoubeirou a Mayaki

TL;DR
This paper develops a theoretical framework for understanding the generalization, approximation, and scaling behaviors of Mixture-of-Experts Transformers, highlighting the roles of active parameters and routing in model performance.
Contribution
It introduces a new theory separating active capacity from routing, derives bounds on generalization, and establishes neural scaling laws for MoE architectures.
Findings
Generalization bounds depend on active parameters and routing overhead.
Approximation error can be reduced by scaling capacity or increasing experts.
Derived neural scaling laws for model size, data, and compute tradeoffs.
Abstract
We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a -dimensional manifold data model and targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
