TL;DR
This paper demonstrates that Mixture-of-Experts models are inherently more interpretable at the expert level due to their monosemantic neurons and experts, facilitating better understanding of their linguistic and semantic functions.
Contribution
It introduces a novel expert-level analysis approach showing MoE experts are specialized in linguistic and semantic tasks, improving interpretability of large language models.
Findings
MoE experts are less polysemantic than dense neurons.
Sparsity encourages monosemanticity in neurons and experts.
Experts function as fine-grained task specialists rather than broad domain experts.
Abstract
Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using -sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
