The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

Thomas Walker; Ahmed Imtiaz Humayun; Randall Balestriero; Richard Baraniuk

arXiv:2604.11962·cs.LG·May 11, 2026

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk

PDF

1 Repo

TL;DR

The paper introduces the Linear Centroids Hypothesis, proposing a new way to interpret deep networks by identifying features with centroid-based linear directions, improving interpretability and analysis tools.

Contribution

It presents the Linear Centroids Hypothesis as a novel interpretability framework that unifies various analysis methods using centroid spaces in deep networks.

Findings

01

Replacing activations with centroids yields sparser feature dictionaries.

02

LCH improves interpretability of circuits and saliency maps.

03

Code is available at https://github.com/ThomasWalker1/LinearCentroidsHypothesis.

Abstract

The Linear Representation Hypothesis (LRH) identifies features of a trained deep network (DN) as linear directions in the activation spaces, i.e., output spaces of intermediate layers. This characterization decouples the input-output maps learned by a DN from the organization of feature directions in its activation spaces. We introduce the Linear Centroids Hypothesis (LCH), which instead identifies features with linear directions among a DN's centroid spaces -- where any vector denotes a centroid or summary of a local affine expert characterizing the learned input-output maps of the DN exactly (e.g., for piecewise-affine DNs) or approximately (e.g., for smooth DNs like transformers). We show that replacing intermediate activations with centroids yields a functional drop-in alternative for standard interpretability tools. Empirically, this change yields sparser, more downstream-useful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ThomasWalker1/LinearCentroidsHypothesis
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.