Sparse Autoencoders for Hypothesis Generation
Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson

TL;DR
HypotheSAEs is a novel method using sparse autoencoders and large language models to generate interpretable hypotheses linking text data to target variables, improving discovery efficiency and interpretability.
Contribution
The paper introduces HypotheSAEs, a new approach combining sparse autoencoders and LLMs for interpretable hypothesis generation from text data.
Findings
Outperforms baselines in synthetic datasets (+0.06 F1)
Identifies more significant hypotheses on real data (~2x)
Requires less compute than recent LLM-based methods
Abstract
We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Anomaly Detection Techniques and Applications · Machine Learning and Data Classification
MethodsSparse Autoencoder
