Exemplar Partitioning for Mechanistic Interpretability
Jessica Rumbelow

TL;DR
Exemplar Partitioning (EP) is an unsupervised method for creating interpretable feature dictionaries from language model activations, enabling causal analysis and out-of-distribution detection with significantly reduced compute.
Contribution
This paper introduces EP, a novel unsupervised approach that constructs interpretable activation dictionaries via leader-clustering, facilitating model interpretability and comparison across checkpoints.
Findings
EP dictionaries are interpretable and support causal interventions.
EP regions match SAE features at over 50% F1 score.
EP achieves high AUROC in concept detection with much less compute.
Abstract
We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
