Path-Constrained Mixture-of-Experts

Zijin Gu; Tatiana Likhomanenko; Vimal Thilak; Jason Ramapuram; Navdeep Jaitly

arXiv:2603.18297·cs.LG·April 7, 2026

Path-Constrained Mixture-of-Experts

Zijin Gu, Tatiana Likhomanenko, Vimal Thilak, Jason Ramapuram, Navdeep Jaitly

PDF

TL;DR

This paper introduces Path-Constrained Mixture-of-Experts ( extbackslash pathmoe{}), a new architecture that constrains expert paths to improve efficiency, consistency, and performance in large-scale MoE models.

Contribution

It proposes constraining expert paths in MoE models, demonstrating improved clustering, robustness, and performance without auxiliary losses.

Findings

01

Pathmoe enhances path concentration and cross-layer consistency.

02

Pathmoe models outperform independent routing models on perplexity and downstream tasks.

03

Constrained expert paths improve robustness to routing perturbations.

Abstract

Sparse Mixture-of-Experts (MoE) architectures route each token through a subset of experts at each layer independently. We propose viewing MoE computation through the lens of \emph{expert paths} -- the sequence of expert selections a token makes across all layers. This perspective reveals that, despite $N^{L}$ possible paths for $N$ experts across $L$ layers, tokens in practice cluster into a small fraction of paths that align with linguistic function, yet the vast majority of paths remain unexplored, representing a statistical inefficiency. This motivates architectures that constrain the effective path space to amplify this natural concentration. As one instantiation, we introduce \pathmoe{}, which shares router parameters across blocks of consecutive layers. Analysis confirms that \pathmoe{} amplifies the emergent path structure: it produces more concentrated path clusters, better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.