Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

David Chanin; Tom\'a\v{s} Dulka; Adri\`a Garriga-Alonso

arXiv:2505.11756·cs.LG·September 29, 2025

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

David Chanin, Tom\'a\v{s} Dulka, Adri\`a Garriga-Alonso

PDF

Open Access 1 Repo

TL;DR

This paper identifies a phenomenon called feature hedging in narrow sparse autoencoders, where correlated features merge, reducing interpretability, especially in large language models, and proposes an improved SAE variant to mitigate this issue.

Contribution

It introduces the concept of feature hedging caused by SAE reconstruction loss, analyzes it theoretically and empirically, and proposes a new SAE variant to address the problem.

Findings

01

Feature hedging causes correlated features to merge in narrow SAEs.

02

Narrower SAEs are more susceptible to feature hedging.

03

The proposed matryoshka SAE variant reduces feature hedging effects.

Abstract

It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedging and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chanind/feature-hedging-paper
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Stochastic Gradient Optimization Techniques