Self-Routing: Parameter-Free Expert Routing from Hidden States
Jama Hussein Mohamud, Drew Wagner, Mirco Ravanelli

TL;DR
This paper introduces Self-Routing, a parameter-free expert routing method for Mixture-of-Experts models that uses hidden states directly, eliminating the need for a learned router while maintaining competitive performance.
Contribution
Proposes Self-Routing, a novel parameter-free routing mechanism for MoE layers that simplifies architecture and improves expert utilization without sacrificing accuracy.
Findings
Self-Routing performs comparably to learned routers on GPT-2-scale language modeling.
It achieves about 17% higher normalized routing entropy, indicating better load balancing.
Self-Routing slightly outperforms learned-router MoE on ImageNet-1K classification.
Abstract
Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results show that Self-Routing remains competitive with the learned-router baseline while removing all dedicated routing parameters, and yields more balanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
