TL;DR
SoftSAE introduces a differentiable, input-dependent Top-K mechanism for sparse autoencoders, enabling adaptive feature selection that improves interpretability and representation quality in neural networks.
Contribution
It proposes a novel Soft Top-K operator allowing autoencoders to dynamically adjust sparsity levels per input, enhancing interpretability and feature relevance.
Findings
SoftSAE effectively learns meaningful features.
Adaptive sparsity improves data representation.
Code available at https://github.com/St0pien/SoftSAE.
Abstract
Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human-understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features (K) across all inputs, ignoring the varying complexity of real-world data. Natural data often lies on manifolds with varying local intrinsic dimensionality, meaning the number of relevant factors can change significantly across samples. This suggests that a fixed sparsity level is not optimal. Simple inputs may require only a few features, while more complex ones need more expressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
