Codebook Features: Sparse and Discrete Interpretability for Neural   Networks

Alex Tamkin; Mohammad Taufeeque; Noah D. Goodman

arXiv:2310.17230·cs.LG·October 27, 2023·2 cites

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman

PDF

Open Access 1 Repo

TL;DR

This paper introduces codebook features, a method to make neural network hidden states sparse and discrete through vector quantization, enhancing interpretability and controllability with minimal performance loss.

Contribution

The authors propose a novel quantization-based approach to produce sparse, discrete hidden states in neural networks, improving interpretability and enabling behavior control.

Findings

01

Neural networks can operate with sparse, discrete hidden states with modest performance degradation.

02

Codebook features allow for explicit control of network behavior by activating specific codes.

03

The approach successfully disentangles concepts in language models, enabling topic guidance during inference.

Abstract

Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taufeeque9/codebook-features
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Adversarial Robustness in Machine Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Residual Connection · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization