Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman

TL;DR
This paper introduces codebook features, a method to make neural network hidden states sparse and discrete through vector quantization, enhancing interpretability and controllability with minimal performance loss.
Contribution
The authors propose a novel quantization-based approach to produce sparse, discrete hidden states in neural networks, improving interpretability and enabling behavior control.
Findings
Neural networks can operate with sparse, discrete hidden states with modest performance degradation.
Codebook features allow for explicit control of network behavior by activating specific codes.
The approach successfully disentangles concepts in language models, enabling topic guidance during inference.
Abstract
Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Adversarial Robustness in Machine Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Residual Connection · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization
