Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders
Mengyu Ye, Jun Suzuki, Tatsuro Inaba, Tatsuki Kuribayashi

TL;DR
This paper compares the interpretability of features learned by sparse autoencoders and feed-forward layers in large language models, finding that FFs are nearly as interpretable as SAEs and often better in some aspects.
Contribution
The study systematically evaluates and compares the interpretability of features in FF layers and SAEs, revealing FFs as a strong baseline and questioning the added value of SAEs.
Findings
FFs and SAEs have similar interpretability ranges
SAEs show minimal but observable interpretability improvements
FFs sometimes outperform SAEs in interpretability
Abstract
Recent interpretability work on large language models (LLMs) has been increasingly dominated by a feature-discovery approach with the help of proxy modules. Then, the quality of features learned by, e.g., sparse auto-encoders (SAEs), is evaluated. This paradigm naturally raises a critical question: do such learned features have better properties than those already represented within the original model parameters, and unfortunately, only a few studies have made such comparisons systematically so far. In this work, we revisit the interpretability of feature vectors stored in feed-forward (FF) layers, given the perspective of FF as key-value memories, with modern interpretability benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
