Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

Jordan F. McCann

arXiv:2605.12874·cs.LG·May 14, 2026

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

Jordan F. McCann

PDF

TL;DR

This paper identifies a structural problem in sparse autoencoder interpretability called descriptive collision, where many features share the same explanation, and proposes metrics to address this issue.

Contribution

The authors formalize the concept of descriptive collision, analyze its prevalence in existing datasets, and introduce corrective metrics to improve interpretability scoring.

Findings

01

82.1% of features share annotations with at least one other feature

02

The most common annotation 'plural nouns' labels 101 features across multiple layers

03

Ignoring collision inflates interpretability scores by roughly one-third of feature-identification bits.

Abstract

Sparse autoencoders (SAEs) are now standard tools for decomposing language model activations into interpretable features, and automated interpretability pipelines routinely assign each feature a short natural-language explanation. Existing critiques of this practice focus on polysemanticity -- one feature with many meanings -- or on whether explanations predict activations. We identify a complementary, structurally distinct problem we call descriptive collision: many distinct SAE features admit the same explanation. Reanalyzing the largest publicly-available dataset of human-annotated SAE features (Marks et al., 2025), comprising 722 annotated features across Gemma 2 2B and Pythia 70M, we find that the mean annotation string is reused across 3.07 features; 82.1% of features share their annotation with at least one other feature; and the single most common annotation string ("plural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.