Exemplar Partitioning for Mechanistic Interpretability

Jessica Rumbelow

arXiv:2605.14347·cs.LG·May 19, 2026

Exemplar Partitioning for Mechanistic Interpretability

Jessica Rumbelow

PDF

TL;DR

Exemplar Partitioning (EP) is an unsupervised method for creating interpretable feature dictionaries from language model activations, enabling causal analysis and out-of-distribution detection with significantly reduced compute.

Contribution

This paper introduces EP, a novel unsupervised approach that constructs interpretable activation dictionaries via leader-clustering, facilitating model interpretability and comparison across checkpoints.

Findings

01

EP dictionaries are interpretable and support causal interventions.

02

EP regions match SAE features at over 50% F1 score.

03

EP achieves high AUROC in concept detection with much less compute.

Abstract

We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with $\sim 1 0^{3} \times$ fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.