Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features
Jordan F. McCann

TL;DR
This paper identifies a structural problem in sparse autoencoder interpretability called descriptive collision, where many features share the same explanation, and proposes metrics to address this issue.
Contribution
The authors formalize the concept of descriptive collision, analyze its prevalence in existing datasets, and introduce corrective metrics to improve interpretability scoring.
Findings
82.1% of features share annotations with at least one other feature
The most common annotation 'plural nouns' labels 101 features across multiple layers
Ignoring collision inflates interpretability scores by roughly one-third of feature-identification bits.
Abstract
Sparse autoencoders (SAEs) are now standard tools for decomposing language model activations into interpretable features, and automated interpretability pipelines routinely assign each feature a short natural-language explanation. Existing critiques of this practice focus on polysemanticity -- one feature with many meanings -- or on whether explanations predict activations. We identify a complementary, structurally distinct problem we call descriptive collision: many distinct SAE features admit the same explanation. Reanalyzing the largest publicly-available dataset of human-annotated SAE features (Marks et al., 2025), comprising 722 annotated features across Gemma 2 2B and Pythia 70M, we find that the mean annotation string is reused across 3.07 features; 82.1% of features share their annotation with at least one other feature; and the single most common annotation string ("plural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
