Disentangling Dense Embeddings with Sparse Autoencoders
Charles O'Neill, Christine Ye, Kartheik Iyer, John F. Wu

TL;DR
This paper applies sparse autoencoders to large language model embeddings of scientific abstracts, producing interpretable sparse features that retain semantic meaning and enable precise semantic search control.
Contribution
It introduces a novel application of SAEs to dense text embeddings, demonstrating their interpretability and utility in semantic search and concept disentanglement.
Findings
Sparse autoencoders produce interpretable features from dense embeddings.
Features maintain semantic fidelity across different model capacities.
Enables fine-grained control in semantic search applications.
Abstract
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models, demonstrating their effectiveness in disentangling semantic concepts. By training SAEs on embeddings of over 420,000 scientific paper abstracts from computer science and astronomy, we show that the resulting sparse representations maintain semantic fidelity while offering interpretability. We analyse these learned features, exploring their behaviour across different model capacities and introducing a novel method for identifying ``feature families'' that represent related concepts at varying levels of abstraction. To demonstrate the practical utility of our approach, we show how these interpretable features can be used to precisely steer semantic search, allowing for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning
