Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

Akshay Kulkarni; Tsui-Wei Weng; Vivek Narayanaswamy; Shusen Liu; Wesam A. Sakla; Kowshik Thopalli

arXiv:2512.10805·cs.LG·April 1, 2026

Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

Akshay Kulkarni, Tsui-Wei Weng, Vivek Narayanaswamy, Shusen Liu, Wesam A. Sakla, Kowshik Thopalli

PDF

1 Models

TL;DR

This paper introduces CB-SAE, a framework that enhances interpretability and steerability of sparse autoencoders in large language and vision models by pruning and augmenting their latent space.

Contribution

It proposes a novel post-hoc method to improve interpretability and steerability of SAEs through neuron pruning and concept bottleneck augmentation.

Findings

01

CB-SAE improves interpretability by 32.1%

02

CB-SAE enhances steerability by 14.5%

03

The method is effective across LVLMs and image generation tasks.

Abstract

Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires learned features to be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics for a systematic analysis of LVLM SAEs. This uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) user-desired concepts are often absent in the SAE, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ryanyen22/reason-first-program
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.