Addressing divergent representations from causal interventions on neural networks

Satchel Grant; Simon Jerome Han; Alexa R. Tartaglini; Christopher Potts

arXiv:2511.04638·cs.LG·April 24, 2026

Addressing divergent representations from causal interventions on neural networks

Satchel Grant, Simon Jerome Han, Alexa R. Tartaglini, Christopher Potts

PDF

1 Video

TL;DR

This paper investigates how causal interventions in neural networks can cause out-of-distribution representations, affecting interpretability, and proposes methods to mitigate harmful divergences for more reliable explanations.

Contribution

It provides a theoretical and empirical analysis of divergences caused by causal interventions and introduces a modified CL loss to keep representations closer to the natural distribution.

Findings

01

Common causal interventions often shift representations out-of-distribution.

02

Distinction between harmless and pernicious divergences impacts interpretability.

03

Modified CL loss reduces harmful divergences while maintaining interpretive power.

Abstract

A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Addressing divergent representations from causal interventions on neural networks· slideslive