TL;DR
The paper introduces the Linear Centroids Hypothesis, proposing a new way to interpret deep networks by identifying features with centroid-based linear directions, improving interpretability and analysis tools.
Contribution
It presents the Linear Centroids Hypothesis as a novel interpretability framework that unifies various analysis methods using centroid spaces in deep networks.
Findings
Replacing activations with centroids yields sparser feature dictionaries.
LCH improves interpretability of circuits and saliency maps.
Code is available at https://github.com/ThomasWalker1/LinearCentroidsHypothesis.
Abstract
The Linear Representation Hypothesis (LRH) identifies features of a trained deep network (DN) as linear directions in the activation spaces, i.e., output spaces of intermediate layers. This characterization decouples the input-output maps learned by a DN from the organization of feature directions in its activation spaces. We introduce the Linear Centroids Hypothesis (LCH), which instead identifies features with linear directions among a DN's centroid spaces -- where any vector denotes a centroid or summary of a local affine expert characterizing the learned input-output maps of the DN exactly (e.g., for piecewise-affine DNs) or approximately (e.g., for smooth DNs like transformers). We show that replacing intermediate activations with centroids yields a functional drop-in alternative for standard interpretability tools. Empirically, this change yields sparser, more downstream-useful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
