Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

Mengyu Ye; Jun Suzuki; Tatsuro Inaba; Tatsuki Kuribayashi

arXiv:2510.22332·cs.LG·October 28, 2025

Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

Mengyu Ye, Jun Suzuki, Tatsuro Inaba, Tatsuki Kuribayashi

PDF

TL;DR

This paper compares the interpretability of features learned by sparse autoencoders and feed-forward layers in large language models, finding that FFs are nearly as interpretable as SAEs and often better in some aspects.

Contribution

The study systematically evaluates and compares the interpretability of features in FF layers and SAEs, revealing FFs as a strong baseline and questioning the added value of SAEs.

Findings

01

FFs and SAEs have similar interpretability ranges

02

SAEs show minimal but observable interpretability improvements

03

FFs sometimes outperform SAEs in interpretability

Abstract

Recent interpretability work on large language models (LLMs) has been increasingly dominated by a feature-discovery approach with the help of proxy modules. Then, the quality of features learned by, e.g., sparse auto-encoders (SAEs), is evaluated. This paradigm naturally raises a critical question: do such learned features have better properties than those already represented within the original model parameters, and unfortunately, only a few studies have made such comparisons systematically so far. In this work, we revisit the interpretability of feature vectors stored in feed-forward (FF) layers, given the perspective of FF as key-value memories, with modern interpretability benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.