TL;DR
This paper introduces a new evaluation framework for sparse autoencoders that assesses their ability to produce interpretable monosemantic features, especially for polysemous words, revealing insights into their internal mechanisms and limitations.
Contribution
It proposes a suite of semantic evaluation metrics for SAEs, highlighting their strengths and weaknesses in capturing polysemy, and offers insights into how LLMs distinguish multiple meanings within words.
Findings
SAEs optimized for MSE-L0 may reduce interpretability.
Deeper layers and Attention modules help distinguish polysemy.
Semantic evaluation reveals limitations of traditional SAE metrics.
Abstract
Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve the interpretability of large language models (LLMs) by mapping the complex superposition of polysemantic neurons into monosemantic features and composing a sparse dictionary of words. However, traditional performance metrics like Mean Squared Error and L0 sparsity ignore the evaluation of the semantic representational power of SAEs -- whether they can acquire interpretable monosemantic features while preserving the semantic relationship of words. For instance, it is not obvious whether a learned sparse feature could distinguish different meanings in one word. In this paper, we propose a suite of evaluations for SAEs to analyze the quality of monosemantic features by focusing on polysemous words. Our findings reveal that SAEs developed to improve the MSE-L0 Pareto frontier may confuse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
