Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

Ivan Ternovtsii; Yurii Bilak

arXiv:2604.14419·cs.AI·April 17, 2026

Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

Ivan Ternovtsii, Yurii Bilak

PDF

TL;DR

This study shows that routing topology in sparse Mixture-of-Experts models does not significantly impact language modeling quality, with various routing methods yielding statistically equivalent perplexity results.

Contribution

It demonstrates that different routing topologies in MoE models are functionally equivalent in terms of perplexity, challenging the importance of routing design.

Findings

01

Routing topology does not determine asymptotic perplexity.

02

Cosine routing with fewer parameters performs similarly to more complex methods.

03

Multi-hop updates act as magnitude amplifiers rather than compositional reasoning.

Abstract

Sparse Mixture-of-Experts (MoE) architectures employ increasingly sophisticated routing mechanisms -- learned routers, multi-hop trajectories, token-dependent gating. We ask: does routing topology actually determine language modeling quality? We build a geometric MoE (ST-MoE) using cosine-similarity routing against learned centroids in a low-dimensional space ( $d_{s p a ce} = 64$ ), requiring 80% fewer routing parameters than standard linear routers. Through 62 controlled experiments on WikiText-103 at 76--84M parameters trained to convergence (50K steps, 1.64B tokens), we find that routing topology does not determine asymptotic perplexity (PPL): five cosine-routing variants are statistically equivalent within a 1-PPL margin (Two One-Sided Tests [TOST], $p < 0.05$ for all 10 pairwise comparisons; 15 runs across 3 seeds, observed range 33.93--34.72). The finding extends to hash,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.