Self-Routing: Parameter-Free Expert Routing from Hidden States

Jama Hussein Mohamud; Drew Wagner; Mirco Ravanelli

arXiv:2604.00421·cs.AI·April 2, 2026

Self-Routing: Parameter-Free Expert Routing from Hidden States

Jama Hussein Mohamud, Drew Wagner, Mirco Ravanelli

PDF

TL;DR

This paper introduces Self-Routing, a parameter-free expert routing method for Mixture-of-Experts models that uses hidden states directly, eliminating the need for a learned router while maintaining competitive performance.

Contribution

Proposes Self-Routing, a novel parameter-free routing mechanism for MoE layers that simplifies architecture and improves expert utilization without sacrificing accuracy.

Findings

01

Self-Routing performs comparably to learned routers on GPT-2-scale language modeling.

02

It achieves about 17% higher normalized routing entropy, indicating better load balancing.

03

Self-Routing slightly outperforms learned-router MoE on ImageNet-1K classification.

Abstract

Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results show that Self-Routing remains competitive with the learned-router baseline while removing all dedicated routing parameters, and yields more balanced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.