LatentQA: Teaching LLMs to Decode Activations Into Natural Language
Alexander Pan, Lijie Chen, Jacob Steinhardt

TL;DR
LatentQA introduces a natural language decoding probe for language models, enabling detailed interpretability and control of activations, outperforming existing methods and scaling effectively with larger datasets and models.
Contribution
We develop a novel natural language decoding probe for language models, creating a dataset of activations and question-answer pairs, and demonstrate its effectiveness in interpretability and control tasks.
Findings
Outperforms existing probing baselines in reading tasks
Can steer models to exhibit unseen behaviors
Scales well with larger datasets and models
Abstract
Top-down transparency typically analyzes language model activations using probes with scalar or single-token outputs, limiting the range of behaviors that can be captured. To alleviate this issue, we develop a more expressive probe that can directly output natural language, performing LatentQA: the task of answering open-ended questions about activations. A key difficulty in developing such a probe is collecting a dataset mapping activations to natural-language descriptions. In response, we propose an approach for generating a dataset of activations and associated question-answer pairs and develop a fine-tuning method for training a decoder LLM on this dataset. We then validate our decoder's fidelity by assessing its ability to read and control model activations. First, we evaluate the decoder on a number of supervised reading tasks with a known answer, such as uncovering hidden system…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper addresses a significant limitation of current probe models by training decoders that can return rich, open-ended natural language answers based on LLM activations. This extends the scope of interpretability to capture nuanced behaviors and model states that scalar or single-token probes cannot express. - The work presents a detailed procedure for dataset curation, including control and stimulus prompts, masking strategies, and data augmentation, resulting in a comprehensive training
1. **Insufficient mathematical formalization of core mechanisms:** The paper presents the core ideas clearly at a conceptual level but lacks rigorous mathematical exposition in several critical areas. The patching mechanism for transferring activations from the target LLM's layer k to the decoder LLM's layer ℓ is described operationally but not formally defined. There is no explicit mathematical specification of the patch operation—whether it involves replacement, addition, linear transformation
The paper is well written and illustrated. Evaluation is performed on multiple tasks, and demonstrates the effectiveness of the method, at least for the tested Llama model. The impact of the steering method might be significant, as it allows for targeted modification of model's behaviour without training on big datasets of demonstrations.
The main weakness of the paper is the lack of evaluation of different models. It shows the effectiveness of the method on the selected Llama model, but it does not tell if the method generalizes to different settings. It would be beneficial to see results for different model families, for example Qwen3 models, or steering capability on the gpt-oss model which is known for strong safety/alignment training. Moreover: - For scaling experiments, the authors only show the test loss values. It does n
1. **Novelty of the Task**: The core concept of LATENTQA is ambitious and novel. It moves transparency research beyond simple, low-bandwidth probes (linear, scalar) towards a much more expressive, high-bandwidth framework. The idea of "captioning" or "interrogating" a model's internal state using natural language is a compelling research direction. Treating activation itself as a modality for QA may provide foundation for further interpretation/intervention. 2. **Creative Experimental Design**:
1. **Unfair/Incomplete Baseline Comparisons**: This is the most significant weakness. The LIT method involves fine-tuning a decoder model, yet it is primarily compared against untrained methods (SelfIE, Patchscope) or much simpler linear probes training. It is not surprising that a fine-tuned LLM decoder outperforms these methods. - For Control: A more convincing comparison would be against standard behavioral fine-tuning (SFT) or preference-tuning methods (DPO/RLHF) using the same (prompt, com
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Artificial Intelligence in Law · Topic Modeling
