TL;DR
This paper introduces CB-SAE, a framework that enhances interpretability and steerability of sparse autoencoders in large language and vision models by pruning and augmenting their latent space.
Contribution
It proposes a novel post-hoc method to improve interpretability and steerability of SAEs through neuron pruning and concept bottleneck augmentation.
Findings
CB-SAE improves interpretability by 32.1%
CB-SAE enhances steerability by 14.5%
The method is effective across LVLMs and image generation tasks.
Abstract
Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires learned features to be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics for a systematic analysis of LVLM SAEs. This uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) user-desired concepts are often absent in the SAE, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
