Decomposition of Small Transformer Models
Casper L. Christensen, Logan Riggs

TL;DR
This paper extends the Stochastic Parameter Decomposition method to Transformer models, enabling the identification of interpretable components and mechanisms within complex neural networks like GPT-2-small.
Contribution
It introduces an updated causal importance function and loss function for SPD, successfully applying it to both toy models and real Transformer models.
Findings
SPD can decompose toy induction-head models successfully.
SPD locates interpretable concepts in GPT-2-small.
First step towards applying SPD to modern models.
Abstract
Recent work in mechanistic interpretability has shown that decomposing models in parameter space may yield clean handles for analysis and intervention. Previous methods have demonstrated successful applications on a wide range of toy models, but the gap to "real models" has not yet been bridged. In this work, we extend Stochastic Parameter Decomposition (SPD) to Transformer models, proposing an updated causal importance function suited for sequential data and a new loss function. We demonstrate that SPD can successfully decompose a toy induction-head model and recover the expected 2-step circuit. We also show that applying SPD to GPT-2-small can successfully locate subcomponents corresponding to interpretable concepts like "golf" and "basketball". These results take the first step in the direction of extending SPD to modern models, and show that we can use the method to surface…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Model Reduction and Neural Networks · Generative Adversarial Networks and Image Synthesis
