MEMENTO: Teaching LLMs to Manage Their Own Context

Vasilis Kontonis; Yuchen Zeng; Shivam Garg; Lingjiao Chen; Hao Tang; Ziyan Wang; Ahmed Awadallah; Eric Horvitz; John Langford; Dimitris Papailiopoulos

arXiv:2604.09852·cs.AI·April 14, 2026

MEMENTO: Teaching LLMs to Manage Their Own Context

Vasilis Kontonis, Yuchen Zeng, Shivam Garg, Lingjiao Chen, Hao Tang, Ziyan Wang, Ahmed Awadallah, Eric Horvitz, John Langford, Dimitris Papailiopoulos

PDF

TL;DR

MEMENTO introduces a method for teaching large language models to segment reasoning, compress intermediate states into summaries, and attend only to these summaries, reducing context size and computational load.

Contribution

The paper presents a novel training approach and dataset for enabling models to manage their reasoning context via segmentation and summarization, improving efficiency and accuracy.

Findings

01

Models trained with MEMENTO maintain strong accuracy on benchmarks.

02

Achieves approximately 2.5x reduction in KV cache usage.

03

Extends vLLM to improve throughput by about 1.75x.

Abstract

Reasoning models think in long, unstructured streams with no mechanism for compressing or organizing their own intermediate state. We introduce MEMENTO: a method that teaches models to segment reasoning into blocks, compress each block into a memento, i.e., a dense state summary, and reason forward by attending only to mementos, reducing context, KV cache, and compute. To train MEMENTO models, we release OpenMementos, a public dataset of 228K reasoning traces derived from OpenThoughts-v3, segmented and annotated with intermediate summaries. We show that a two-stage SFT recipe on OpenMementos is effective across different model families (Qwen3, Phi-4, Olmo 3) and scales (8B--32B parameters). Trained models maintain strong accuracy on math, science, and coding benchmarks while achieving $\sim 2.5 \times$ peak KV cache reduction. We extend vLLM to support our inference method, achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.