Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
Philipp Mondorf, Sondre Wold, Barbara Plank

TL;DR
This paper investigates the modular structure of transformer-based language models by identifying and analyzing circuits for compositional subtasks, revealing their reusability and potential for representing complex functions.
Contribution
It introduces a method to identify and compare circuits for modular subtasks, demonstrating their overlap, faithfulness, and compositional reuse within language models.
Findings
Circuits for similar tasks show significant node overlap.
Identified circuits are faithful to task behavior.
Circuits can be combined to model complex functions.
Abstract
A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions through subnetworks that can be composed to perform more complex tasks. Recent advances in mechanistic interpretability have made progress in identifying , which represent the minimal computational subgraphs responsible for a model's behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits to each other. To address this gap, we study the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques
MethodsSparse Evolutionary Training · Focus
