Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants
Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt

TL;DR
This paper introduces Predictive Concept Decoders, a scalable end-to-end interpretability method that predicts model behavior from internal activations using a compressed concept representation, improving interpretability and downstream task performance.
Contribution
It proposes a novel training framework for interpretability assistants that encode activations into concepts and answer questions, enabling scalable and effective model explanations.
Findings
Improved auto-interp scores with more data
Effective detection of jailbreaks and secret hints
Accurate surface of latent user attributes
Abstract
Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Multimodal Machine Learning Applications
