Path-Constrained Mixture-of-Experts
Zijin Gu, Tatiana Likhomanenko, Vimal Thilak, Jason Ramapuram, Navdeep Jaitly

TL;DR
This paper introduces Path-Constrained Mixture-of-Experts ( extbackslash pathmoe{}), a new architecture that constrains expert paths to improve efficiency, consistency, and performance in large-scale MoE models.
Contribution
It proposes constraining expert paths in MoE models, demonstrating improved clustering, robustness, and performance without auxiliary losses.
Findings
Pathmoe enhances path concentration and cross-layer consistency.
Pathmoe models outperform independent routing models on perplexity and downstream tasks.
Constrained expert paths improve robustness to routing perturbations.
Abstract
Sparse Mixture-of-Experts (MoE) architectures route each token through a subset of experts at each layer independently. We propose viewing MoE computation through the lens of \emph{expert paths} -- the sequence of expert selections a token makes across all layers. This perspective reveals that, despite possible paths for experts across layers, tokens in practice cluster into a small fraction of paths that align with linguistic function, yet the vast majority of paths remain unexplored, representing a statistical inefficiency. This motivates architectures that constrain the effective path space to amplify this natural concentration. As one instantiation, we introduce \pathmoe{}, which shares router parameters across blocks of consecutive layers. Analysis confirms that \pathmoe{} amplifies the emergent path structure: it produces more concentrated path clusters, better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
