Monet: Mixture of Monosemantic Experts for Transformers

Jungwoo Park; Young Jin Ahn; Kee-Eung Kim; Jaewoo Kang

arXiv:2412.04139·cs.AI·June 12, 2025

Monet: Mixture of Monosemantic Experts for Transformers

Jungwoo Park, Young Jin Ahn, Kee-Eung Kim, Jaewoo Kang

PDF

Open Access 1 Repo 9 Models 1 Video 3 Reviews

TL;DR

Monet introduces a scalable mixture-of-experts architecture with monosemantic experts, enhancing interpretability and controllability of large language models without sacrificing performance.

Contribution

It proposes a novel expert decomposition method integrated into end-to-end training, enabling scaling to over 262,144 experts per layer while maintaining efficiency.

Findings

01

Experts are mutually exclusive in knowledge representation.

02

Monet enables knowledge manipulation across domains and languages.

03

Model performance remains stable despite large expert counts.

Abstract

Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- The paper tackles an interesting and important question for the field: instead of interpreting LLMs post-hoc, can we directly train them in a way that results in interpretable weights? - This adds to existing work, such as backpack LLMs https://arxiv.org/abs/2305.16765 and codebook features https://arxiv.org/abs/2310.17230 - The proposed architecture is interesting, can (in principle) represent a large number of experts, and performs on par with the LLaMA baseline of roughly the same paramet

Weaknesses

- The lack of detailed interpretability baselines makes it difficult to evaluate the strength of the results. - For example, the only interpretability method used as a baseline is patching reconstructions from SAEs for Gemma-2B. However, it is not reported what sparsity these SAEs achieve compared to the (effective?) sparsity of MONET. This makes it difficult to make sense of the results. - The only relevant baseline here is using SAEs at the MLP layers, because this matches the MONET setup;

Reviewer 02Rating 8Confidence 4

Strengths

I like the idea of the paper. Some earlier works noticed that experts display some monosemanticity [1,2] and it is great to see this work push this idea. I also think that the set of experiments is very convincing and I believe that this work may be influential for getting more interpretable neural networks. [1] Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." Journal of Machine Learning Research 23.12

Weaknesses

I think the main weakness of the paper is the presentation + writing, especially in Section 3. I am happy to consider improving my score if much better explanations of the method are given in Section 3. - **Section 3 should be more clear (especially the Horizontal and Vertical decomposition)**: I read the work by Lample et al. [1] for completing this review and according to my understanding, there is a unique $(u_i, v_i)$ that is associated with each key. Their approach makes sense to me. --

Reviewer 03Rating 6Confidence 3

Strengths

- The paper presents novel decomposition methods that scales traditional MoE to 262k experts. - The paper delivers comprehensive experimental results on the proposed model architecture. - The proposed method achieves good expert specialization, proven under several experimental settings

Weaknesses

- The intuition behind the architecture design is unclear. - The explanation in the methodology section is poor and hard to understand.

Code & Models

Repositories

dmis-lab/monet
pytorchOfficial

Models

Videos

Monet: Mixture of Monosemantic Experts for Transformers· slideslive

Taxonomy

Topicslinguistics and terminology studies · Linguistics, Language Diversity, and Identity · Natural Language Processing Techniques

MethodsMixture model network