TL;DR
MEME introduces a comprehensive benchmark for evaluating multi-entity and evolving memory capabilities in LLM-based agents across six tasks, revealing significant challenges in dependency reasoning that current systems struggle to overcome.
Contribution
The paper presents a new benchmark with six tasks for multi-entity and evolving memory evaluation, including three novel tasks, and provides an extensive analysis of existing memory systems' performance.
Findings
All memory systems fail at dependency reasoning under default settings.
Prompt optimization and stronger LLMs do not significantly improve dependency reasoning.
A file-based agent with Claude Opus 4.7 partially closes the gap but at high cost.
Abstract
LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
