MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He; Yu Wang; Churan Zhi; Yuanzhe Hu; Tzu-Ping Chen; Lang Yin; Ze Chen; Tong Arthur Wu; Siru Ouyang; Zihan Wang; Jiaxin Pei; Julian McAuley; Yejin Choi; Alex Pentland

arXiv:2602.16313·cs.CL·February 19, 2026

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, Alex Pentland

PDF

Open Access 1 Datasets

TL;DR

MemoryArena provides a comprehensive benchmark for evaluating how agents utilize long-term memory across interdependent, multi-session tasks, revealing gaps in current memory evaluation methods.

Contribution

It introduces MemoryArena, a unified benchmark for assessing agent memory in realistic multi-session, interdependent tasks, bridging a gap in existing evaluation frameworks.

Findings

01

Agents perform poorly on MemoryArena compared to existing benchmarks.

02

Memory utilization is crucial for solving interdependent, multi-session tasks.

03

Current benchmarks do not adequately evaluate memory in realistic agent settings.

Abstract

Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ZexueHe/memoryarena
dataset· 2.0k dl
2.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Multimodal Machine Learning Applications · Social Robot Interaction and HRI