Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Sagi Ahrac, Noya Hochwald, Mor Geva

TL;DR
This paper uncovers a geometric coupling mechanism in sparse mixture-of-experts models, explaining how routing decisions align with expert activations and how auxiliary losses affect this structure, leading to improved routing strategies.
Contribution
It reveals the geometric coupling between routers and experts, analyzes the impact of load balancing losses, and introduces a parameter-free online K-Means router for better routing.
Findings
Router and expert weights share gradient directions, forming a geometric coupling.
Auxiliary load balancing disrupts this coupling, increasing similarity among router directions.
A parameter-free K-Means router achieves low load imbalance with minimal perplexity increase.
Abstract
Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router--expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
