LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Di Wu; Zixiang Ji; Asmi Kawatkar; Bryan Kwan; Jia-Chen Gu; Nanyun Peng; Kai-Wei Chang

arXiv:2605.12493·cs.CL·May 13, 2026

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Di Wu, Zixiang Ji, Asmi Kawatkar, Bryan Kwan, Jia-Chen Gu, Nanyun Peng, Kai-Wei Chang

PDF

1 Repo 1 Datasets

TL;DR

LongMemEval-V2 introduces a comprehensive benchmark to evaluate long-term memory systems in web agents, focusing on their ability to recall environment-specific experience for improved performance.

Contribution

The paper presents a new benchmark, LME-V2, and two memory methods, demonstrating significant performance improvements and highlighting challenges in long-term agent memory.

Findings

01

AgentRunbook-C achieves 72.5% accuracy, outperforming baselines.

02

Memory methods improve question-answering accuracy in web environments.

03

Coding agent methods have high latency, indicating a trade-off between performance and efficiency.

Abstract

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaowu0162/LongMemEval-V2
github

Datasets

xiaowu0162/longmemeval-v2
dataset· 885 dl
885 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.