Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features
John Winnicki, Abeynaya Gnanasekaran, Eric Darve

TL;DR
This paper introduces a method to construct domain-specific, interpretable knowledge graphs from sparse autoencoder features, enhancing understanding of model internal representations and relationships.
Contribution
It presents a novel multi-stage filtering and graph construction process to transform flat autoencoder features into coherent, readable knowledge graphs for domain understanding.
Findings
Graphs recover chapter and subchapter structure in biology texts
Reveal concepts bridging related topics
Transform sparse features into compact, interpretable views
Abstract
Sparse autoencoders (SAEs) extract millions of interpretable features from a language model, but flat feature inventories aren't very useful on their own. Domain concepts get mixed with generic and weakly grounded features, while related ideas are scattered across many units, and there's no way to understand relationships between features. We address this by first constructing a strict domain-specific concept universe from a large SAE inventory using contrastive activations and a multi-stage filtering process. Next, we build two aligned graph views on the filtered set: a co-occurrence graph for corpus-level conceptual structure, organized at multiple levels of granularity, and a transcoder-based mechanism graph that links source-layer and target-layer features through sparse latent pathways. Automated edge labeling then turns these graph views into readable knowledge graphs rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
