TL;DR
This paper critically examines causal abstraction in neural networks, showing that without constraints on the alignment maps, the concept becomes trivial, challenging its effectiveness for mechanistic interpretability.
Contribution
It proves that unrestricted causal abstraction maps can trivially relate any neural network to any algorithm, highlighting the need for assumptions about information encoding.
Findings
Unrestricted causal maps can perfectly align models with arbitrary algorithms.
Empirical evidence shows models incapable of solving tasks can still be perfectly aligned.
Lifting linearity constraints makes causal abstraction vacuous without assumptions on information encoding.
Abstract
The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
