ParaScopes: What do Language Models Activations Encode About Future Text?

Nicky Pochinkov; Yulia Volkova; Anna Vasileva; Sai V R Chereddy

arXiv:2511.00180·cs.CL·November 4, 2025

ParaScopes: What do Language Models Activations Encode About Future Text?

Nicky Pochinkov, Yulia Volkova, Anna Vasileva, Sai V R Chereddy

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Residual Stream Decoders to probe language model activations for paragraph and document-scale future plans, revealing that models encode information equivalent to over five tokens of future context, aiding interpretability.

Contribution

The paper presents a novel framework for decoding long-term planning information from language model activations, extending interpretability to larger context horizons.

Findings

01

Information equivalent to 5+ tokens of future context can be decoded.

02

Decoding methods are effective across different model sizes.

03

Framework enables better understanding of long-term planning in language models.

Abstract

Interpretability studies in language models often investigate forward-looking representations of activations. However, as language models become capable of doing ever longer time horizon tasks, methods for understanding activations often remain limited to testing specific concepts or tokens. We develop a framework of Residual Stream Decoders as a method of probing model activations for paragraph-scale and document-scale plans. We test several methods and find information can be decoded equivalent to 5+ tokens of future context in small models. These results lay the groundwork for better monitoring of language models and better understanding how they might encode longer-term planning information.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- The paper is well-written with a clear structure. Both the methodology and experimental setup are presented in a straightforward, easy-to-follow manner. - The authors offer a valuable methodological advance by operationalizing planning as the decodability of upcoming text from residual stream activations.

Weaknesses

- The paper equates the ability to decode future tokens or paragraphs from residual stream activations with evidence of "planning." This interpretation is not convincing to me. The residual stream contains contextual information that makes certain continuations more likely, even if no explicit future information is stored. For example, if the context describes the first half of a soccer match, the model predicting the second half next is just a result of coherence, not evidence of a stored "plan

Reviewer 02Rating 0Confidence 3

Strengths

* The layer-wise and temporal dynamics studies in sections 6.1 and 6.2 provide some interesting analysis of language model hidden states * Method diagram are a helpful aid to the writing

Weaknesses

* Experimental setup for the main experimental results (in Figure 4) is problematic. They use data generated by the target model to then evaluate the target model. Obviously text sampled from the target model is likely under the target model. This introduces a major source of confounding. I expect a language model would generate the same or similar next couple tokens only given its last hidden state. Not necessarily because it has plan them out but possibly because its just the likely next thing

Reviewer 03Rating 4Confidence 3

Strengths

- The paper introduces two distinct and complementary probing methods, the intervention-based Continuation ParaScope and the mapping-based TAE ParaScope, which strengthens the validity of its conclusions. - The study goes beyond simply finding evidence of planning by localizing the relevant signals, identifying the middle layers of the network as the primary location for paragraph-level planning information. - The paper provides a specific temporal account of planning, presenting evidence for a

Weaknesses

- The findings are based almost entirely on a single, relatively small (3B parameter) model, making it unclear if they apply to larger, more capable architectures. - The main TAE ParaScope method uses a linear map, which may be too simple to extract more complex, non-linearly encoded plans from the model's activations. - The work primarily establishes that future-looking information is present in activations, but provides limited evidence that this information is causally used by the model to gu

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · AI-based Problem Solving and Planning · Multimodal Machine Learning Applications