TL;DR
UniPool introduces a shared expert pool architecture for Mixture-of-Experts models, reducing parameters and improving performance by treating expert capacity as a global resource rather than per-layer.
Contribution
This work proposes UniPool, a novel shared expert pool design with auxiliary loss and NormRouter, enabling more efficient and scalable MoE models with sublinear expert growth.
Findings
UniPool consistently improves validation loss and perplexity across multiple model scales.
Reduced-pool UniPool variants match or outperform layer-wise MoE with only 41.6%-66.7% of expert parameters.
Expert parameters can grow sublinearly with depth while maintaining or improving model performance.
Abstract
Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
