From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
Ruben Fernandez-Boullon, Pablo Magari\~nos-Docampo, Javier Perez-Robles

TL;DR
This paper introduces a graph-based method using Weisfeiler-Lehman analysis to better understand features in sparse autoencoders by capturing higher-order token co-occurrence structures.
Contribution
It presents a novel graph-structured representation and a custom graph kernel to analyze and cluster autoencoder features beyond traditional token list methods.
Findings
Clustering recovers heuristic motif families not found by cosine similarity.
Graph view surfaces structural relationships missed by token-frequency and decoder-weight views.
Cluster assignments are stable across hyperparameters and random seeds.
Abstract
Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight vectors, leaving the higher-order co-occurrence structure shared across features largely unexamined. We introduce a graph-structured representation in which each SAE feature is modelled as a token co-occurrence graph: nodes are the tokens most frequent near strong activations, and edges connect pairs that co-occur within local context windows. A custom WL-style, frequency-binned graph kernel then provides a similarity measure over this structural space. Applied as a proof of concept to features from a large SAE trained on GPT-2 Small and probed with a synthetic mixed-domain corpus, our clustering recovers heuristic motif families…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
