The Deleuzian Representation Hypothesis
Cl\'ement Cornet, Romaric Besan\c{c}on, Herv\'e Le Borgne

TL;DR
This paper introduces a novel unsupervised clustering method inspired by Deleuze's philosophy to extract interpretable concepts from neural networks, outperforming prior autoencoder-based methods and enabling causal influence on model behavior.
Contribution
It proposes a new clustering-based approach for concept extraction that enhances diversity and interpretability, grounded in a discriminant analysis framework and inspired by Deleuze's philosophy.
Findings
Outperforms prior SAE-based concept extraction methods.
Achieves concept quality close to supervised approaches.
Enables causal steering of model representations.
Abstract
We propose an alternative to sparse autoencoders (SAEs) as a simple and effective unsupervised method for extracting interpretable concepts from neural networks. The core idea is to cluster differences in activations, which we formally justify within a discriminant analysis framework. To enhance the diversity of extracted concepts, we refine the approach by weighting the clustering using the skewness of activations. The method aligns with Deleuze's modern view of concepts as differences. We evaluate the approach across five models and three modalities (vision, language, and audio), measuring concept quality, diversity, and consistency. Our results show that the proposed method achieves concept quality surpassing prior unsupervised SAE variants while approaching supervised baselines, and that the extracted concepts enable steering of a model's inner representations, demonstrating their…
Peer Reviews
Decision·ICLR 2026 Poster
I really like the perspective of engaging with the rich literature on concepts in philosophy and using that to motivate interpretability approaches, but I wish the paper went deeper on this narrative.
Reading through, I was first quite excited about the paper's idea and narrative (to look at activation differences), but, as it currently stands, I think the paper's operationalization of its core idea falls short of its promise. Main apprehensions are listed below. - Use of clustering to define "concepts": The paper currently takes differences of activations and simply run a clustering protocol (Kmeans) to extract "concepts" from it. While I'm not necessarily a big fan of SAEs, that approach a
The paper is well written, and I found at least 4 strong points in my opinion. S1. Simplicity. The method has an appealingly simple pipeline, is easy to understand and reproduce compared to SAE variants with multiple hyperparameters. S2. Broad empirical evaluation. The paper provides extensive experiments across three modalities (vision, text, audio), five models, and multiple datasets, with systematic probe loss evaluation across 874 attributes. This breadth is commendable. S3. Competitive
Even if I like the work, I notice several flaws, some major (labeled M) and some minor (labeled m). Below I detail these concerns: M1. Lack of operational definition for "concept." Section 2.1 lists desiderata but never provides a clear, falsifiable definition of what constitutes a concept beyond achieving low probe loss. The philosophical framing around Deleuze adds narrative color but doesn't translate into concrete, testable predictions that distinguish this approach from standard clustering
I find this work to provide a good alternative to SAEs towards interpretability. It constitutes a simple and nice approach grounded in discriminant analysis and clustering, while the inverse-skewness weighting is an interesting modification to improve concept diversity. The empirical evaluation considers multiple modalities and architectures, while the quantitative evaluation avoids the commonly considered sparsity-reconstruction tradeoff and uses the probe loss and MPPC towards concept explora
The use of KMeans on activation differences is conceptually interesting, but it’s unclear how representative the randomly sampled pairs are. Could the sampling procedure bias the extracted concepts? Is the number of concepts fixed a priori? Is this the value that dictates the number of clusters for KMeans? How does the method fair when considering different values? The inverse skewness weighting needs a further expansion. The inspiration is the Feature-Weigthed KMeans, but do any of its prope
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Embodied and Extended Cognition
