How to use and interpret activation patching
Stefan Heimersheim, Neel Nanda

TL;DR
This paper offers practical guidance on activation patching, a key interpretability method in neural networks, emphasizing best practices, interpretation nuances, and pitfalls to improve understanding of circuit mechanisms.
Contribution
It provides a comprehensive overview of activation patching application methods, interpretation strategies, and discusses common pitfalls, based on practical experience.
Findings
Different application methods for activation patching are summarized.
Guidelines for interpreting patching results are provided.
Potential pitfalls and best practices are discussed.
Abstract
Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results. We provide a summary of advice and best practices, based on our experience using this technique in practice. We include an overview of the different ways to apply activation patching and a discussion on how to interpret the results. We focus on what evidence patching experiments provide about circuits, and on the choice of metric and associated pitfalls.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUsability and User Interface Design · Business Process Modeling and Analysis · Intelligent Tutoring Systems and Adaptive Learning
MethodsActivation Patching · Focus
