Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

Yilong Chen; Naibin Gu; Junyuan Shang; Zhenyu Zhang; Yuchen Feng; Jiawei Sheng; Tingwen Liu; Shuohuan Wang; Yu Sun; Hua Wu; Haifeng Wang

arXiv:2603.04971·cs.LG·March 6, 2026

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

PDF

Open Access

TL;DR

This paper introduces Mixture of Universal Experts (MOUE), a novel MoE architecture that scales model capacity by converting depth into virtual width, enabling more efficient and scalable models beyond traditional width and depth limits.

Contribution

The paper proposes MOUE, a new MoE generalization that introduces Virtual Width, along with innovative routing and load balancing methods to improve scalability and performance.

Findings

01

MOUE outperforms baseline MoE models by up to 1.3% in various regimes.

02

Enables progressive conversion of existing MoE checkpoints with up to 4.2% gains.

03

Reveals Virtual Width as a new scaling dimension for MoE architectures.

Abstract

Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Mobile Crowdsensing and Crowdsourcing