Transformer Circuit Faithfulness Metrics are not Robust
Joseph Miller, Bilal Chughtai, William Saunders

TL;DR
This paper critically examines the robustness of circuit faithfulness metrics in neural network interpretability, revealing their high sensitivity to methodological variations and emphasizing the need for clearer interpretability claims.
Contribution
The authors survey existing circuit faithfulness metrics, demonstrate their sensitivity to experimental choices, and provide an open-source library for more reliable circuit analysis methods.
Findings
Existing metrics are highly sensitive to ablation methodology.
Circuit faithfulness scores depend on experimental setup, not just circuit components.
Clearer standards are needed for interpreting neural network circuits.
Abstract
Mechanistic interpretability work attempts to reverse engineer the learned algorithms present inside neural networks. One focus of this work has been to discover 'circuits' -- subgraphs of the full model that explain behaviour on specific tasks. But how do we measure the performance of such circuits? Prior work has attempted to measure circuit 'faithfulness' -- the degree to which the circuit replicates the performance of the full model. In this work, we survey many considerations for designing experiments that measure circuit faithfulness by ablating portions of the model's computation. Concerningly, we find existing methods are highly sensitive to seemingly insignificant changes in the ablation methodology. We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit - the task a circuit is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Quality and Harmonics · Magnetic Properties and Applications · Low-power high-performance VLSI design
MethodsFocus · Lib
