The Rate-Distortion-Polysemanticity Tradeoff in SAEs
Tommaso Mencattini, Francesco Montagna, Francesco Locatello

TL;DR
This paper explores the inherent tradeoff in Sparse Autoencoders between accuracy, efficiency, and interpretability, revealing how data distribution influences polysemanticity and proposing benchmarks for measuring it.
Contribution
It introduces the Rate-Distortion-Polysemanticity tradeoff in SAEs, combining theoretical analysis and empirical validation, and benchmarks proxy metrics on language models.
Findings
Restricting SAEs to be monosemantic increases rate and distortion.
Polysemanticity depends on feature co-occurrence probabilities in data.
Benchmarking proxy metrics on Large Language Models reveals data-driven polysemanticity influences.
Abstract
Sparse Autoencoders (SAEs) that can accurately reconstruct their input (minimizing distortion) by making efficient use of few features (minimizing the rate) often fail to learn monosemantic representations (highly interpretable), limiting their usefulness for mechanistic interpretability. In this paper, we characterise this tension in learning faithful, efficient, and interpretable explanations, introducing the Rate-Distortion-Polysemanticity tradeoff in SAEs. Under toy-modeling assumptions, we theoretically and empirically show that restricting the SAE to be monosemantic necessarily comes with an increase in rate and distortion. Assuming a generative model behind the input observations, we further demonstrate that the degree of polysemanticity of optimal SAEs is determined by the training data distribution, especially by the probability of features to co-occur. Finally, we extend the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
