TL;DR
This paper introduces a semi-nonnegative matrix factorization method to extract interpretable, sparse features from language model activations, improving causal interpretability and revealing hierarchical concept structures.
Contribution
It proposes a novel SNMF-based approach for unsupervised feature extraction that outperforms existing methods like SAEs in interpretability and causal evaluation.
Findings
SNMF features outperform SAEs in causal steering tasks.
Features align with human-interpretable concepts.
Neuron combinations are reused across related features, indicating hierarchy.
Abstract
A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
