Transcoders Find Interpretable LLM Feature Circuits
Jacob Dunefsky, Philippe Chlenski, Neel Nanda

TL;DR
This paper introduces transcoders, a novel method for interpreting transformer models by approximating dense MLP layers with sparse ones, enabling clearer circuit analysis and insights into model behavior.
Contribution
The paper presents a new approach using transcoders for weights-based circuit analysis, improving interpretability of MLP sublayers in language models.
Findings
Transcoders perform comparably to SAEs in sparsity and interpretability.
Successfully trained on models up to 1.4B parameters.
Revealed new insights into the GPT2-small 'greater-than circuit'.
Abstract
A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
