Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features

John Winnicki; Abeynaya Gnanasekaran; Eric Darve

arXiv:2604.23829·cs.AI·April 29, 2026

Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features

John Winnicki, Abeynaya Gnanasekaran, Eric Darve

PDF

TL;DR

This paper introduces a method to construct domain-specific, interpretable knowledge graphs from sparse autoencoder features, enhancing understanding of model internal representations and relationships.

Contribution

It presents a novel multi-stage filtering and graph construction process to transform flat autoencoder features into coherent, readable knowledge graphs for domain understanding.

Findings

01

Graphs recover chapter and subchapter structure in biology texts

02

Reveal concepts bridging related topics

03

Transform sparse features into compact, interpretable views

Abstract

Sparse autoencoders (SAEs) extract millions of interpretable features from a language model, but flat feature inventories aren't very useful on their own. Domain concepts get mixed with generic and weakly grounded features, while related ideas are scattered across many units, and there's no way to understand relationships between features. We address this by first constructing a strict domain-specific concept universe from a large SAE inventory using contrastive activations and a multi-stage filtering process. Next, we build two aligned graph views on the filtered set: a co-occurrence graph for corpus-level conceptual structure, organized at multiple levels of granularity, and a transcoder-based mechanism graph that links source-layer and target-layer features through sparse latent pathways. Automated edge labeling then turns these graph views into readable knowledge graphs rather than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.