Data-driven Circuit Discovery for Interpretability of Language Models
Daking Rai, Mor Geva, Ziyu Yao

TL;DR
This paper introduces Data-driven Circuit Discovery (DCD), a framework that uncovers multiple, more faithful circuits within language models by clustering examples based on model processing, challenging the assumption of a single circuit per task.
Contribution
DCD drops the assumptions of a single circuit per task and dataset, enabling discovery of multiple, more accurate circuits aligned with model processing groups.
Findings
DCD discovers multiple circuits per dataset, each more faithful to its group.
Existing methods often find dataset-specific circuits rather than general task circuits.
DCD reveals mechanistic structure within language models beyond human-defined task boundaries.
Abstract
Circuit discovery aims to explain how language models (LMs) implement a specific task by localizing and interpreting a circuit, a computational subgraph responsible for the LM's behavior. Existing circuit discovery methods are hypothesis-driven; they first informally define a task with a dataset, and then apply a circuit discovery algorithm over that dataset to obtain a single circuit. This imposes two strong assumptions: that the LM implements the task with a single circuit, and that the dataset adequately represents the task as humans understand it. We systematically test these assumptions across four previously studied tasks and find that even minor dataset variations that preserve task semantics can produce circuits with low edge overlap and cross-dataset faithfulness. More strikingly, when applied to a mixed dataset with two distinct tasks whose separately discovered circuits have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
