Dense SAE Latents Are Features, Not Bugs
Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark

TL;DR
This paper investigates dense latents in sparse autoencoders for language models, revealing they are meaningful, functionally relevant features rather than mere artifacts or noise.
Contribution
It provides a systematic analysis of dense latents, showing they are intrinsic, meaningful, and evolve across layers, challenging the view that they are undesirable artifacts.
Findings
Dense latents form antipodal pairs reconstructing specific directions.
Ablating dense latents suppresses new dense features, indicating their intrinsic nature.
Dense features relate to position, context, POS, and output signals, evolving through model layers.
Abstract
Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are \emph{dense}), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs -- suggesting that high density features are an intrinsic property of the residual space. We then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
