Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

Vincent Huang; Dami Choi; Daniel D. Johnson; Sarah Schwettmann; Jacob Steinhardt

arXiv:2512.15712·cs.AI·December 18, 2025

Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt

PDF

Open Access

TL;DR

This paper introduces Predictive Concept Decoders, a scalable end-to-end interpretability method that predicts model behavior from internal activations using a compressed concept representation, improving interpretability and downstream task performance.

Contribution

It proposes a novel training framework for interpretability assistants that encode activations into concepts and answer questions, enabling scalable and effective model explanations.

Findings

01

Improved auto-interp scores with more data

02

Effective detection of jailbreaks and secret hints

03

Accurate surface of latent user attributes

Abstract

Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Multimodal Machine Learning Applications