The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

Denis Sutter; Julian Minder; Thomas Hofmann; Tiago Pimentel

arXiv:2507.08802·cs.LG·November 13, 2025

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

Denis Sutter, Julian Minder, Thomas Hofmann, Tiago Pimentel

PDF

1 Video

TL;DR

This paper critically examines causal abstraction in neural networks, showing that without constraints on the alignment maps, the concept becomes trivial, challenging its effectiveness for mechanistic interpretability.

Contribution

It proves that unrestricted causal abstraction maps can trivially relate any neural network to any algorithm, highlighting the need for assumptions about information encoding.

Findings

01

Unrestricted causal maps can perfectly align models with arbitrary algorithms.

02

Empirical evidence shows models incapable of solving tasks can still be perfectly aligned.

03

Lifting linearity constraints makes causal abstraction vacuous without assumptions on information encoding.

Abstract

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?· slideslive