Rethinking Evaluation of Sparse Autoencoders through the Representation   of Polysemous Words

Gouki Minegishi; Hiroki Furuta; Yusuke Iwasawa; Yutaka Matsuo

arXiv:2501.06254·cs.CL·February 19, 2025

Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

Gouki Minegishi, Hiroki Furuta, Yusuke Iwasawa, Yutaka Matsuo

PDF

1 Repo

TL;DR

This paper introduces a new evaluation framework for sparse autoencoders that assesses their ability to produce interpretable monosemantic features, especially for polysemous words, revealing insights into their internal mechanisms and limitations.

Contribution

It proposes a suite of semantic evaluation metrics for SAEs, highlighting their strengths and weaknesses in capturing polysemy, and offers insights into how LLMs distinguish multiple meanings within words.

Findings

01

SAEs optimized for MSE-L0 may reduce interpretability.

02

Deeper layers and Attention modules help distinguish polysemy.

03

Semantic evaluation reveals limitations of traditional SAE metrics.

Abstract

Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve the interpretability of large language models (LLMs) by mapping the complex superposition of polysemantic neurons into monosemantic features and composing a sparse dictionary of words. However, traditional performance metrics like Mean Squared Error and L0 sparsity ignore the evaluation of the semantic representational power of SAEs -- whether they can acquire interpretable monosemantic features while preserving the semantic relationship of words. For instance, it is not obvious whether a learned sparse feature could distinguish different meanings in one word. In this paper, we propose a suite of evaluations for SAEs to analyze the quality of monosemantic features by focusing on polysemous words. Our findings reveal that SAEs developed to improve the MSE-L0 Pareto frontier may confuse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gouki510/ps-eval
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need