LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
Di Wu, Zixiang Ji, Asmi Kawatkar, Bryan Kwan, Jia-Chen Gu, Nanyun Peng, Kai-Wei Chang

TL;DR
LongMemEval-V2 introduces a comprehensive benchmark to evaluate long-term memory systems in web agents, focusing on their ability to recall environment-specific experience for improved performance.
Contribution
The paper presents a new benchmark, LME-V2, and two memory methods, demonstrating significant performance improvements and highlighting challenges in long-term agent memory.
Findings
AgentRunbook-C achieves 72.5% accuracy, outperforming baselines.
Memory methods improve question-answering accuracy in web environments.
Coding agent methods have high latency, indicating a trade-off between performance and efficiency.
Abstract
Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
