A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

David Chanin; James Wilken-Smith; Tom\'a\v{s} Dulka; Hardik Bhatnagar; Satvik Golechha; Joseph Bloom

arXiv:2409.14507·cs.CL·November 18, 2025

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

David Chanin, James Wilken-Smith, Tom\'a\v{s} Dulka, Hardik Bhatnagar, Satvik Golechha, Joseph Bloom

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the limitations of sparse autoencoders in decomposing language model features, revealing a phenomenon called feature absorption where hierarchical features fail to remain distinct, impacting interpretability.

Contribution

The paper introduces the concept of feature absorption in SAEs, demonstrates its causes, and proposes a metric for detection, highlighting fundamental challenges in feature decomposition.

Findings

01

Feature absorption causes monosemantic features to merge into child features.

02

Varying SAE sizes or sparsity does not mitigate absorption issues.

03

Empirical validation on hundreds of LLM SAEs confirms the phenomenon.

Abstract

Sparse Autoencoders (SAEs) aim to decompose the activation space of large language models (LLMs) into human-interpretable latent directions or features. As we increase the number of features in the SAE, hierarchical features tend to split into finer features ("math" may split into "algebra", "geometry", etc.), a phenomenon referred to as feature splitting. However, we show that sparse decomposition and splitting of hierarchical features is not robust. Specifically, we show that seemingly monosemantic features fail to fire where they should, and instead get "absorbed" into their children features. We coin this phenomenon feature absorption, and show that it is caused by optimizing for sparsity in SAEs whenever the underlying features form a hierarchy. We introduce a metric to detect absorption in SAEs, and validate our findings empirically on hundreds of LLM SAEs. Our investigation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lasr-spelling/sae-spelling
pytorchOfficial

Videos

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis