Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
Li Puyin, Jiyuan Tan, Ahmad Jabbar, Thomas Icard, Atticus Geiger

TL;DR
This paper introduces a diagnostic method for neural network interpretability that partitions input space to identify where high-level causal hypotheses hold or fail, aiding in interpretation refinement.
Contribution
It proposes a novel input space partitioning approach for diagnosing and improving causal abstraction-based interpretations in neural networks.
Findings
Partitioning input space reveals regions where interpretations are accurate or fail.
The method enables error analysis and hypothesis refinement in causal interpretability.
Recursive application recovers high-level hypotheses in toy logic tasks.
Abstract
We present a method for diagnosing interpretation in neural networks by identifying an input subspace where a proposed interpretation is highly faithful. Our method is particularly useful for causal-abstraction-style interpretability, where a high-level causal hypothesis is evaluated by interchange interventions. Rather than treating interchange intervention accuracy as a single global summary, we refine this framework by partitioning the input space into well-interpreted and under-interpreted regions according to pairwise interchange-intervention behavior. This turns causal abstraction from a purely global evaluation into a more diagnostic tool: it not only measures whether an interpretation works, but also reveals where it works, where it fails, and what distinguishes the two cases. This diagnostic view also provides practical heuristics for improving interpretations. By analyzing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
