A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
David Chanin, James Wilken-Smith, Tom\'a\v{s} Dulka, Hardik Bhatnagar, Satvik Golechha, Joseph Bloom

TL;DR
This paper investigates the limitations of sparse autoencoders in decomposing language model features, revealing a phenomenon called feature absorption where hierarchical features fail to remain distinct, impacting interpretability.
Contribution
The paper introduces the concept of feature absorption in SAEs, demonstrates its causes, and proposes a metric for detection, highlighting fundamental challenges in feature decomposition.
Findings
Feature absorption causes monosemantic features to merge into child features.
Varying SAE sizes or sparsity does not mitigate absorption issues.
Empirical validation on hundreds of LLM SAEs confirms the phenomenon.
Abstract
Sparse Autoencoders (SAEs) aim to decompose the activation space of large language models (LLMs) into human-interpretable latent directions or features. As we increase the number of features in the SAE, hierarchical features tend to split into finer features ("math" may split into "algebra", "geometry", etc.), a phenomenon referred to as feature splitting. However, we show that sparse decomposition and splitting of hierarchical features is not robust. Specifically, we show that seemingly monosemantic features fail to fire where they should, and instead get "absorbed" into their children features. We coin this phenomenon feature absorption, and show that it is caused by optimizing for sparsity in SAEs whenever the underlying features form a hierarchy. We introduce a metric to detect absorption in SAEs, and validate our findings empirically on hundreds of LLM SAEs. Our investigation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
