Evaluating Sparse Autoencoders for Monosemantic Representation
Moghis Fereidouni, Muhammad Umair Haider, Peizhong Ju, A.B. Siddique

TL;DR
This paper systematically evaluates sparse autoencoders' ability to produce more interpretable, monosemantic neuron representations in large language models, introducing a new quantitative measure and a novel intervention method.
Contribution
It introduces a concept separability score based on Jensen-Shannon distance, compares SAEs with base models across multiple datasets, and proposes the APP intervention method for precise concept control.
Findings
SAEs reduce polysemanticity and improve concept separability.
SAEs enable more precise concept-level control with partial suppression.
The APP method achieves effective concept removal with minimal perplexity increase.
Abstract
A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features. While prior work suggests that SAEs promote monosemanticity, no quantitative comparison has examined how concept activation distributions differ between SAEs and their base models. This paper provides the first systematic evaluation of SAEs against base models through activation distribution lens. We introduce a fine-grained concept separability score based on the Jensen-Shannon distance, which captures how distinctly a neuron's activation distributions vary across concepts. Using two large language models (Gemma-2-2B and DeepSeek-R1) and multiple SAE variants across five datasets (including word-level and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare
