Is This the Subspace You Are Looking for? An Interpretability Illusion   for Subspace Activation Patching

Aleksandar Makelov; Georg Lange; Neel Nanda

arXiv:2311.17030·cs.LG·December 7, 2023·1 cites

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Aleksandar Makelov, Georg Lange, Neel Nanda

PDF

Open Access 1 Repo

TL;DR

This paper reveals that subspace activation patching can create an illusion of interpretability by activating alternative pathways, which may not be causally connected to the model's outputs, challenging assumptions in mechanistic interpretability.

Contribution

The study demonstrates that subspace interventions can produce misleading interpretability signals and provides a mechanistic explanation for this phenomenon through theoretical and empirical analysis.

Findings

01

Subspace patching can activate dormant pathways, misleading interpretability.

02

Evidence of the phenomenon in real-world tasks like object identification and factual recall.

03

Link between rank-1 fact editing and interpretability illusions.

Abstract

Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to simultaneously manipulate model behavior and attribute the features behind it to given subspaces. In this work, we demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability. Counterintuitively, even if a subspace intervention makes the model's output behave as if the value of a feature was changed, this effect may be achieved by activating a dormant parallel pathway leveraging another subspace that is causally disconnected from model outputs. We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amakelov/activation-patching-illusion
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Materials Science · Topic Modeling