TL;DR
This paper investigates how sparsity and superposition affect the interpretability and specialization of mixture of experts models, revealing that increased sparsity leads to more interpretable, monosemantic expert representations without loss of performance.
Contribution
It introduces new metrics for superposition, redefines expert specialization based on feature coherence, and demonstrates that network sparsity enhances interpretability in MoE models.
Findings
Greater network sparsity correlates with increased monosemanticity.
Models with higher sparsity organize experts around coherent features.
Sparsity enables interpretability without performance loss.
Abstract
Mixture of Experts (MoE) models have become central to scaling large language models, yet their mechanistic differences from dense networks remain poorly understood. Previous work has explored how dense models use \textit{superposition} to represent more features than dimensions, and how superposition is a function of feature sparsity and feature importance. MoE models cannot be explained mechanistically through the same lens. We find that neither feature sparsity nor feature importance cause discontinuous phase changes, and that network sparsity (the ratio of active to total experts) better characterizes MoEs. We develop new metrics for measuring superposition across experts. Our findings demonstrate that models with greater network sparsity exhibit greater \emph{monosemanticity}. We propose a new definition of expert specialization based on monosemantic feature representation rather…
Peer Reviews
Decision·Submitted to ICLR 2026
* The question of how the mixture of experts architecture interacts with how concepts are represented by neurons (monosemantically vs. polysemantically) is interesting. * The analyses of this paper look like they are probably quite interesting. I just had a very tough time reading them because the basic definitions and setup are not presented.
* Some definitions are missing, which makes the paper unclear in cases and hard to read in others. (See my questions below for examples.) The paper could greatly benefit from a clearer exposition of definitions so that readers can understand what the authors concretely mean by a "feature", or by "monosemanticity" in this context. * The analyses are all conducted on toy models, without any analysis, e.g. of MoE models trained on real data.
- The exploration of expert specialization and initialization is interesting to me. These topics provide insights into a better understanding of MoE behaviors. - The authors conduct extensive experiments to support their idea.
- Some of the paper is a straightforward extension of the anthropic blog, which adapts the research on dense models to MoE models. As a result, the contribution feels somewhat limited. - The authors' findings on toy models are interesting, but they are not entirely convincing to me due to the experimental setups. Firstly, the experiments are conducted on toy models with a very small hidden dimension (e.g., 6 or even 1). While interesting, it is hard for me to trust the conclusions draw from suc
The approach potentially gives insight into representation and expert specialization in MoEs
I had great difficulty figuring out what was done in many parts of the paper. I don’t normally share such detailed notes as I do in the Questions section, but in this case I do to help explain how much work this paper needs. The main claim that “MoEs represent the same number of features as the dense model, but more monosemantically” (e.g., L220) seems impossible. How can two models match in the number of features they represent and the number of dimensions (“parameters”) they use, but differ i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
