Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei; Noveen Sachdeva; Benjamin Coleman; Zhankui He; Yuanchen Bei; Xuying Ning; Mengting Ai; Yunzhe Li; Jingrui He; Ed H. Chi; Chi Wang; Shuo Chen; Fernando Pereira; Wang-Cheng Kang; Derek Zhiyuan Cheng

arXiv:2511.20857·cs.CL·May 19, 2026

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng

PDF

2 Datasets

TL;DR

Evo-Memory introduces a new benchmark and framework for evaluating how large language model agents can dynamically learn, adapt, and improve their memory during continuous task streams, addressing a key gap in current static evaluation methods.

Contribution

The paper presents Evo-Memory, a comprehensive streaming benchmark for testing self-evolving memory in LLM agents, along with implementations of memory modules and a new method ReMem for continual learning.

Findings

01

Evaluated over ten memory modules across ten diverse datasets.

02

Demonstrated the effectiveness of ReMem in continual improvement.

03

Provided insights into memory management for long-term LLM agent deployment.

Abstract

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Multi-Agent Systems and Negotiation