The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Xiaoyuan Liu; Tian Liang; Dongyang Ma; Deyu Zhou; Haitao Mi; Pinjia He; Yan Wang

arXiv:2602.12108·cs.AI·February 13, 2026

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Xiaoyuan Liu, Tian Liang, Dongyang Ma, Deyu Zhou, Haitao Mi, Pinjia He, Yan Wang

PDF

Open Access 3 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces StateLM, a new class of language models capable of actively managing their own context and memory tools, leading to significant improvements in long-document understanding and reasoning tasks.

Contribution

We propose StateLM, a model with an internal reasoning loop and memory management tools, enabling dynamic context engineering and surpassing standard LLM performance.

Findings

01

StateLM outperforms standard LLMs on long-document QA tasks.

02

StateLM achieves 10-20% accuracy improvements on chat memory tasks.

03

StateLM reaches up to 52% accuracy on BrowseComp-Plus, compared to 5% for standard models.

Abstract

In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the "wand" to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory. This work finally places the wand in the model's hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manage their own state. We equip our model with a suite of memory tools, such as context pruning, document indexing, and note-taking, and train it to actively manage these tools. By learning to dynamically engineering its own context, our model breaks free from the architectural prison of a fixed window. Experiments across various model sizes…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- This is a simple and novel idea: the model becomes active, inspects its current memory/state and accordingly constructs the context to operate on using pre-defined tools. - No pressing need for the user-in-the-middle role of building prompts conditioned on a manually inspected state (automation). - Clean guidelines for training, set of orthogonal tools well-defined. - Performance on long-context recall and QA benchmarks are impressive.

Weaknesses

- Critical requirement for the availability of a strong LLM for the generation of training samples (in particular for process-mode classification) - The set of tools is given, is generic enough but it certainly cannot fit any question handling.

Reviewer 02Rating 4Confidence 3

Strengths

● The problem addressed in this paper is crucial: transitioning from stateless LLMs to a stateful paradigm enables long-term reasoning, multi-turn dialogue memory, and cross-session continuity. ● The paper is well-written, and the case study in Section 3 provides an intuitive and effective way to illustrate the Pensieve paradigm. ● The “model as the wizard” framing, i.e., pushing the model toward fully autonomous decision-making about when (and, potentially in future work, how) to manage its own

Weaknesses

1. The paper aims at a meaningful goal of achieving a fully automated workflow, since heuristic and human-defined pipelines may not fully unlock the capability of LLMs. However, the framework still relies on manually defined tools, making it essentially semi-automated. Given that prior work (e.g., Memory-R1) also trains models to learn what memory operations to perform, the main difference here seems to lie in when those operations are triggered. Memory-R1 updates memory after each turn, which i

Reviewer 03Rating 4Confidence 4

Strengths

- The data curation pipeline is carefully designed. multi-stage filtering and process-mode classification (search vs. scan) produce cleaner trajectories for training. - SLM w/o search greatly outperforms baseline by a large margin especially after 256K tokens. - On real-world tasks like NovelQA and InfiniteBench, the results are impressive where SLM with short context (32K) can achieve better performance than instruct model with context of 128K token - Good writing. The description of StateLM an

Weaknesses

- It is not clear how StateLM materially differs from prior work (e.g., A-Mem, SCM, Dynamic Cheatsheet). The claim of “not a fixed workflow loop” does not really establish novelty, as this function has been supported by agentic toolkit like Anthropic’s Model Context Protocol (MCP) and also has been explored by prior work. - The training trajectories come from Sonnet-4, which along with many open-source agents already can decide which tools to use given context. As presented, the contribution is

Code & Models

Models

Datasets

lindsay21/longbench_v2_transformed_rl
dataset· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Machine Learning in Healthcare