Transcoders Find Interpretable LLM Feature Circuits

Jacob Dunefsky; Philippe Chlenski; Neel Nanda

arXiv:2406.11944·cs.LG·November 8, 2024·6 cites

Transcoders Find Interpretable LLM Feature Circuits

Jacob Dunefsky, Philippe Chlenski, Neel Nanda

PDF

Open Access 1 Repo 2 Models 1 Video

TL;DR

This paper introduces transcoders, a novel method for interpreting transformer models by approximating dense MLP layers with sparse ones, enabling clearer circuit analysis and insights into model behavior.

Contribution

The paper presents a new approach using transcoders for weights-based circuit analysis, improving interpretability of MLP sublayers in language models.

Findings

01

Transcoders perform comparably to SAEs in sparsity and interpretability.

02

Successfully trained on models up to 1.4B parameters.

03

Revealed new insights into the GPT2-small 'greater-than circuit'.

Abstract

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jacobdunefsky/transcoder_circuits
jaxOfficial

Models

Videos

Transcoders find interpretable LLM feature circuits· slideslive

Taxonomy

TopicsNatural Language Processing Techniques