Generalization and Scaling Laws for Mixture-of-Experts Transformers

Mansour Zoubeirou a Mayaki

arXiv:2604.09175·cs.LG·April 13, 2026

Generalization and Scaling Laws for Mixture-of-Experts Transformers

Mansour Zoubeirou a Mayaki

PDF

TL;DR

This paper develops a theoretical framework for understanding the generalization, approximation, and scaling behaviors of Mixture-of-Experts Transformers, highlighting the roles of active parameters and routing in model performance.

Contribution

It introduces a new theory separating active capacity from routing, derives bounds on generalization, and establishes neural scaling laws for MoE architectures.

Findings

01

Generalization bounds depend on active parameters and routing overhead.

02

Approximation error can be reduced by scaling capacity or increasing experts.

03

Derived neural scaling laws for model size, data, and compute tradeoffs.

Abstract

We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a $d$ -dimensional manifold data model and $C^{β}$ targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.