TL;DR
This paper investigates how causal interventions in neural networks can cause out-of-distribution representations, affecting interpretability, and proposes methods to mitigate harmful divergences for more reliable explanations.
Contribution
It provides a theoretical and empirical analysis of divergences caused by causal interventions and introduces a modified CL loss to keep representations closer to the natural distribution.
Findings
Common causal interventions often shift representations out-of-distribution.
Distinction between harmless and pernicious divergences impacts interpretability.
Modified CL loss reduces harmful divergences while maintaining interpretive power.
Abstract
A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
