Partially Rewriting a Transformer in Natural Language

Gon\c{c}alo Paulo; Nora Belrose

arXiv:2501.18838·cs.LG·February 3, 2025

Partially Rewriting a Transformer in Natural Language

Gon\c{c}alo Paulo, Nora Belrose

PDF

Open Access 1 Repo

TL;DR

This paper explores partially rewriting a large language model using natural language explanations, aiming to improve interpretability while maintaining model performance through a pipeline of approximation, explanation, and simulation.

Contribution

It introduces a novel interpretability pipeline that replaces parts of a language model with natural language explanations and simulators, advancing understanding of model internals.

Findings

01

Replacing model components with explanations does not significantly increase loss.

02

More detailed explanations are needed for better performance.

03

The approach maintains model behavior close to original despite modifications.

Abstract

The greatest ambition of mechanistic interpretability is to completely rewrite deep neural networks in a format that is more amenable to human understanding, while preserving their behavior and performance. In this paper, we attempt to partially rewrite a large language model using simple natural language explanations. We first approximate one of the feedforward networks in the LLM with a wider MLP with sparsely activating neurons - a transcoder - and use an automated interpretability pipeline to generate explanations for these neurons. We then replace the first layer of this sparse MLP with an LLM-based simulator, which predicts the activation of each neuron given its explanation and the surrounding context. Finally, we measure the degree to which these modifications distort the model's final output. With our pipeline, the model's increase in loss is statistically similar to entirely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eleutherai/sae-auto-interp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques