Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Nick Jiang; Xiaoqing Sun; Lisa Dunlap; Lewis Smith; Neel Nanda

arXiv:2512.10092·cs.AI·December 12, 2025

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Nick Jiang, Xiaoqing Sun, Lisa Dunlap, Lewis Smith, Neel Nanda

PDF

Open Access 3 Reviews

TL;DR

This paper introduces sparse autoencoder embeddings as an interpretable, cost-effective, and controllable alternative to dense embeddings and LLMs for analyzing large-scale text data, uncovering insights and biases.

Contribution

The paper presents a novel application of sparse autoencoders for creating interpretable embeddings that outperform dense models and LLMs in cost, reliability, and controllability for data analysis tasks.

Findings

01

SAE embeddings reveal semantic differences between datasets

02

SAE embeddings uncover unexpected concept correlations

03

SAE embeddings outperform dense embeddings in property-based retrieval

Abstract

Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g. annotating dataset differences) or dense embedding models (e.g. for clustering), which lack control over the properties of interest. We propose using sparse autoencoders (SAEs) to create SAE embeddings: representations whose dimensions map to interpretable concepts. Through four data analysis tasks, we show that SAE embeddings are more cost-effective and reliable than LLMs and more controllable than dense embeddings. Using the large hypothesis space of SAEs, we can uncover insights such as (1) semantic differences between datasets and (2) unexpected concept correlations in documents. For instance, by comparing model responses, we find that Grok-4 clarifies…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- The idea of using SAEs to generate interpretable text embeddings feels novel and well-motivated. - The authors cover an wide breadth of applications - data diffing, correlation discovery, clustering, and retrieval. - The experiments have great coverage, including both toy settings with ground truth targets and real-world exploratory analyses. The authors make a solid effort to incorporate baselines (dense embeddings and LLM-based methods) for comparison. - The real-world case studies find some

Weaknesses

- The paper's breadth makes it challenging to communicate each experiment with sufficient depth. The main text requires constant cross-referencing with the appendix, and key details are often unclear or left for the reader to infer—for example, the latent relabeling procedure, synthetic dataset construction in Section 4.2, what constitutes a "hypothesis," and how hypotheses are verified. - Many results follow a pattern of generating hypotheses, verifying some subset, and presenting the verified

Reviewer 02Rating 2Confidence 4

Strengths

The paper introduces a novel and creative application of SAEs beyond their typical role in LLM interpretability to the domain of textual data analysis. I think SAEs are a great choice as a data analysis toolkit for the following reasons: the interpretable and sparse embeddings offer greater controllability compared to dense embeddings, like enabling pre-filtering of features for targeted analysis of specific properties. Further, SAEs can capture implicit features of chat dialogues beyond coarse

Weaknesses

Overall, the experiments lack rigor, and the work feels preliminary (details below). I see this paper as a good proof-of-concept, and in its current state, it is more suitable for a workshop or a blog post. I have listed some weaknesses along with some suggestions below (loosely in order of priority). Many of them are related to the four data analysis tasks. Personally, I think these tasks could be removed altogether. The paper would be stronger if it focused more on the case studies instead. Y

Reviewer 03Rating 8Confidence 4

Strengths

* Paper is well written and easy to follow, figures are creative and helpful. * Even though SAEs are well known in mech interp, adaptation of them as embedding models is both interesting and novel. * Experimental setups are clearly explained and diverse, and claims are coherent with the findings.

Weaknesses

Major * Lack of ablations on SAEs(size,corpora etc), and similarly for reader LLM, and also diversity of datasets. Minor * A lot of the results are in the appendix, so there's a lot of back and forth while reading.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Advanced Graph Neural Networks