TL;DR
Monet introduces a scalable mixture-of-experts architecture with monosemantic experts, enhancing interpretability and controllability of large language models without sacrificing performance.
Contribution
It proposes a novel expert decomposition method integrated into end-to-end training, enabling scaling to over 262,144 experts per layer while maintaining efficiency.
Findings
Experts are mutually exclusive in knowledge representation.
Monet enables knowledge manipulation across domains and languages.
Model performance remains stable despite large expert counts.
Abstract
Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper tackles an interesting and important question for the field: instead of interpreting LLMs post-hoc, can we directly train them in a way that results in interpretable weights? - This adds to existing work, such as backpack LLMs https://arxiv.org/abs/2305.16765 and codebook features https://arxiv.org/abs/2310.17230 - The proposed architecture is interesting, can (in principle) represent a large number of experts, and performs on par with the LLaMA baseline of roughly the same paramet
- The lack of detailed interpretability baselines makes it difficult to evaluate the strength of the results. - For example, the only interpretability method used as a baseline is patching reconstructions from SAEs for Gemma-2B. However, it is not reported what sparsity these SAEs achieve compared to the (effective?) sparsity of MONET. This makes it difficult to make sense of the results. - The only relevant baseline here is using SAEs at the MLP layers, because this matches the MONET setup;
I like the idea of the paper. Some earlier works noticed that experts display some monosemanticity [1,2] and it is great to see this work push this idea. I also think that the set of experiments is very convincing and I believe that this work may be influential for getting more interpretable neural networks. [1] Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." Journal of Machine Learning Research 23.12
I think the main weakness of the paper is the presentation + writing, especially in Section 3. I am happy to consider improving my score if much better explanations of the method are given in Section 3. - **Section 3 should be more clear (especially the Horizontal and Vertical decomposition)**: I read the work by Lample et al. [1] for completing this review and according to my understanding, there is a unique $(u_i, v_i)$ that is associated with each key. Their approach makes sense to me. --
- The paper presents novel decomposition methods that scales traditional MoE to 262k experts. - The paper delivers comprehensive experimental results on the proposed model architecture. - The proposed method achieves good expert specialization, proven under several experimental settings
- The intuition behind the architecture design is unclear. - The explanation in the methodology section is poor and hard to understand.
Code & Models
- 🤗MonetLLM/monet-vd-1.4B-100BT-hfmodel· 12 dl· ♡ 112 dl♡ 1
- 🤗MonetLLM/codemonet-vd-1.4B-100BT-hfmodel· 4 dl· ♡ 24 dl♡ 2
- 🤗MonetLLM/monet-hd-1.4B-100BT-hfmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗MonetLLM/monet-hd-4.1B-100BT-hfmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗MonetLLM/monet-hd-850M-100BT-hfmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗MonetLLM/monet-vd-4.1B-100BT-hfmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗MonetLLM/monet-vd-850M-100BT-hfmodel· 95 dl· ♡ 295 dl♡ 2
- 🤗MonetLLM/visionmonet-vd-1.4B-100BT-hfmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗MonetLLM/monet-vd-1.4B-100BT-chat-hfmodel· 4 dl· ♡ 24 dl♡ 2
Videos
Taxonomy
Topicslinguistics and terminology studies · Linguistics, Language Diversity, and Identity · Natural Language Processing Techniques
MethodsMixture model network
