When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
Jiaqi Shao, Yiyi Lu, Yunzhen Zhang, and Bing Luo

TL;DR
This paper introduces a scale-conditioned evaluation protocol for agent memory that assesses how evidence usability degrades as irrelevant sessions accumulate, providing nuanced diagnostics across different memory interfaces.
Contribution
It presents a novel evaluation framework that measures memory reliability under evidence-preserving growth, revealing complex failure regimes and supporting scalable-memory claims.
Findings
Reliability loss varies with agent and interface.
HippoRAG maintains call budget but loses reliability as irrelevant sessions increase.
Memory failures depend on agent size and interface, not a single phenomenon.
Abstract
Memory-agent evaluations report fixed-snapshot accuracy or retrieval quality, but these scores do not show whether evidence remains usable as irrelevant sessions (sessions not annotated as task-relevant evidence for the query) accumulate. We present a scale-conditioned evaluation protocol for agent memory under evidence-preserving growth: for each query, task evidence is held fixed while irrelevant sessions are added. The protocol logs agent--memory trajectories and reports four diagnostics: budget-compliant reliability, tail memory-call burden, failure-regime decomposition, and the usable-scale boundary where reliability falls below the target. Applied to LongMemEval and LoCoMo across flat, planar, and hierarchical memory interfaces, the protocol shows reliability loss is not a single phenomenon. On LongMemEval, HippoRAG stays within the two-call budget but loses 16--20 percentage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
