Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Adam Karvonen, James Chua, Cl\'ement Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, Samuel Marks

TL;DR
This paper introduces Activation Oracles, trained to interpret LLM activations through natural language, demonstrating strong generalization and surpassing some existing interpretability methods across various tasks.
Contribution
It presents a generalist approach to understanding LLM activations using LatentQA-trained models called Activation Oracles, effective even in out-of-distribution scenarios.
Findings
Activation Oracles recover fine-tuned information not present in input text.
Adding diverse training data improves AO performance.
AO models match or outperform white-box baselines on multiple tasks.
Abstract
Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ceselder/neural-chameleon-gemma-2-9b-itmodel· 6 dl6 dl
- 🤗ceselder/neural-chameleon-gemma-3-27b-itmodel· 2 dl2 dl
- 🤗eekay/neural-chameleon-10conceptsmodel· 1 dl1 dl
- 🤗eekay/neural-chameleon-10concepts-gemma-2-9b-itmodel· 2 dl2 dl
- 🤗ceselder/activation-oracle-sft-epistemicmodel· 1 dl1 dl
- 🤗ceselder/qwen3-8b-cot-oraclemodel
- 🤗ceselder/cot-oracle-v4-checkpointsmodel
- 🤗ceselder/cot-oracle-ablation-stride5-3layersmodel· 11 dl11 dl
- 🤗Spiritual4/activation-oracle-gemma-3-12b-itmodel· 70 dl70 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Artificial Intelligence in Healthcare and Education
