Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
Ivan Ternovtsii, Yurii Bilak

TL;DR
This study shows that routing topology in sparse Mixture-of-Experts models does not significantly impact language modeling quality, with various routing methods yielding statistically equivalent perplexity results.
Contribution
It demonstrates that different routing topologies in MoE models are functionally equivalent in terms of perplexity, challenging the importance of routing design.
Findings
Routing topology does not determine asymptotic perplexity.
Cosine routing with fewer parameters performs similarly to more complex methods.
Multi-hop updates act as magnitude amplifiers rather than compositional reasoning.
Abstract
Sparse Mixture-of-Experts (MoE) architectures employ increasingly sophisticated routing mechanisms -- learned routers, multi-hop trajectories, token-dependent gating. We ask: does routing topology actually determine language modeling quality? We build a geometric MoE (ST-MoE) using cosine-similarity routing against learned centroids in a low-dimensional space (), requiring 80% fewer routing parameters than standard linear routers. Through 62 controlled experiments on WikiText-103 at 76--84M parameters trained to convergence (50K steps, 1.64B tokens), we find that routing topology does not determine asymptotic perplexity (PPL): five cosine-routing variants are statistically equivalent within a 1-PPL margin (Two One-Sided Tests [TOST], for all 10 pairwise comparisons; 15 runs across 3 seeds, observed range 33.93--34.72). The finding extends to hash,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
