Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit
Nick Jiang, Xiaoqing Sun, Lisa Dunlap, Lewis Smith, Neel Nanda

TL;DR
This paper introduces sparse autoencoder embeddings as an interpretable, cost-effective, and controllable alternative to dense embeddings and LLMs for analyzing large-scale text data, uncovering insights and biases.
Contribution
The paper presents a novel application of sparse autoencoders for creating interpretable embeddings that outperform dense models and LLMs in cost, reliability, and controllability for data analysis tasks.
Findings
SAE embeddings reveal semantic differences between datasets
SAE embeddings uncover unexpected concept correlations
SAE embeddings outperform dense embeddings in property-based retrieval
Abstract
Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g. annotating dataset differences) or dense embedding models (e.g. for clustering), which lack control over the properties of interest. We propose using sparse autoencoders (SAEs) to create SAE embeddings: representations whose dimensions map to interpretable concepts. Through four data analysis tasks, we show that SAE embeddings are more cost-effective and reliable than LLMs and more controllable than dense embeddings. Using the large hypothesis space of SAEs, we can uncover insights such as (1) semantic differences between datasets and (2) unexpected concept correlations in documents. For instance, by comparing model responses, we find that Grok-4 clarifies…
Peer Reviews
Decision·Submitted to ICLR 2026
- The idea of using SAEs to generate interpretable text embeddings feels novel and well-motivated. - The authors cover an wide breadth of applications - data diffing, correlation discovery, clustering, and retrieval. - The experiments have great coverage, including both toy settings with ground truth targets and real-world exploratory analyses. The authors make a solid effort to incorporate baselines (dense embeddings and LLM-based methods) for comparison. - The real-world case studies find some
- The paper's breadth makes it challenging to communicate each experiment with sufficient depth. The main text requires constant cross-referencing with the appendix, and key details are often unclear or left for the reader to infer—for example, the latent relabeling procedure, synthetic dataset construction in Section 4.2, what constitutes a "hypothesis," and how hypotheses are verified. - Many results follow a pattern of generating hypotheses, verifying some subset, and presenting the verified
The paper introduces a novel and creative application of SAEs beyond their typical role in LLM interpretability to the domain of textual data analysis. I think SAEs are a great choice as a data analysis toolkit for the following reasons: the interpretable and sparse embeddings offer greater controllability compared to dense embeddings, like enabling pre-filtering of features for targeted analysis of specific properties. Further, SAEs can capture implicit features of chat dialogues beyond coarse
Overall, the experiments lack rigor, and the work feels preliminary (details below). I see this paper as a good proof-of-concept, and in its current state, it is more suitable for a workshop or a blog post. I have listed some weaknesses along with some suggestions below (loosely in order of priority). Many of them are related to the four data analysis tasks. Personally, I think these tasks could be removed altogether. The paper would be stronger if it focused more on the case studies instead. Y
* Paper is well written and easy to follow, figures are creative and helpful. * Even though SAEs are well known in mech interp, adaptation of them as embedding models is both interesting and novel. * Experimental setups are clearly explained and diverse, and claims are coherent with the findings.
Major * Lack of ablations on SAEs(size,corpora etc), and similarly for reader LLM, and also diversity of datasets. Minor * A lot of the results are in the appendix, so there's a lot of back and forth while reading.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Advanced Graph Neural Networks
