Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder
Zhen Xu, Zhen Tan, Song Wang, Kaidi Xu, Tianlong Chen

TL;DR
This paper introduces novel methods for multi-expert sparse autoencoders that improve feature specialization and diversity, significantly reducing redundancy and reconstruction error, thus enhancing interpretability of large language models efficiently.
Contribution
It proposes two innovations—Multiple Expert Activation and Feature Scaling—to address expert specialization issues in MoE-SAE, improving interpretability and efficiency.
Findings
24% lower reconstruction error
99% reduction in feature redundancy
Enhanced interpretability of LLMs
Abstract
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM explanations, their practical adoption faces a fundamental challenge: better interpretability demands that SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs. Recent Mixture of Experts (MoE) approaches attempt to address this by partitioning SAEs into narrower expert networks with gated activation, thereby reducing computation. In a well-designed MoE, each expert should focus on learning a distinct set of features. However, we identify a \textit{critical limitation} in MoE-SAE: Experts often fail to specialize, which means they frequently learn overlapping or identical…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is well motivated, clearly written, and deeply engages with prior work. The authors correctly identify a central limitation of existing MoE-SAEs and propose two simple, conceptually coherent mechanisms to address it. Both techniques are well defined and integrated cleanly into the SAE framework. The experimental evaluation is thorough, including ablation studies that isolate the contribution of each innovation. The results convincingly demonstrate reductions in feature redundancy and i
The experiments are limited to GPT-2, which is now a dated architecture. Including results on more recent models such as Gemma or LLaMA would strengthen the empirical claims and test generality. The FLOPs-matching procedure is not clearly justified. The authors write that “to match the computational load of activating a fixed number of experts, the hidden dimension is set to 768” for dense SAEs, while Scale SAEs use a total hidden dimension of 24,576. It is unclear how this setup maintains compu
The methodology is explained with precise mathematical notation, and the results are presented with effective visualizations. The experimental evaluation is thorough and convincing. The use of FLOPS-matched comparisons, multiple datasets (in-domain and cross-domain), and a suite of complementary metrics leaves little doubt about the superiority of the proposed method. The ablation studies and mechanistic analysis are executed to a high standard.
> W1. The Mechanistic Rationale for Multiple Expert Activation Requires Deeper Justification. To be very honest, activating more than one expert in MoE is standard practice. The difference here is that we select the Top-K experts across all experts. The paper shows that activating multiple smaller experts outperforms a single larger expert under a FLOPS-matched budget (e.g., 2 experts of size 128 vs. 1 expert of size 256). However, the fundamental reason for this performance boost is not suff
- Clear writing style - Generally readable figures - Clearly notes the limits of the Switch SAE and how they overcome these - Useful to see the performance on two different datasets across both the MSE and Loss Recovered metrics - Interesting exposition when detailing how the two architecture changes help the overall performance of the SAE - Valuable use of the signal processing literature which is a literature that is not always leveraged in interpretability research (and where interpretability
- For interpretability researchers reading the authors might want to be careful about using the term "high frequency" without explaining what this means. In this case it seems to be mostly an analogy to the signal processing literature but within the SAE literature a high frequency feature is typically a feature which activates very often which is a quite different concept. Clarifying this would be useful. - It's also not totally clear why this analogy is a good analogy - exploring why "high-f
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Computational and Text Analysis Methods · Generative Adversarial Networks and Image Synthesis
