When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

Jiaqi Shao; Yiyi Lu; Yunzhen Zhang; and Bing Luo

arXiv:2605.07313·cs.AI·May 11, 2026

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

Jiaqi Shao, Yiyi Lu, Yunzhen Zhang, and Bing Luo

PDF

TL;DR

This paper introduces a scale-conditioned evaluation protocol for agent memory that assesses how evidence usability degrades as irrelevant sessions accumulate, providing nuanced diagnostics across different memory interfaces.

Contribution

It presents a novel evaluation framework that measures memory reliability under evidence-preserving growth, revealing complex failure regimes and supporting scalable-memory claims.

Findings

01

Reliability loss varies with agent and interface.

02

HippoRAG maintains call budget but loses reliability as irrelevant sessions increase.

03

Memory failures depend on agent size and interface, not a single phenomenon.

Abstract

Memory-agent evaluations report fixed-snapshot accuracy or retrieval quality, but these scores do not show whether evidence remains usable as irrelevant sessions (sessions not annotated as task-relevant evidence for the query) accumulate. We present a scale-conditioned evaluation protocol for agent memory under evidence-preserving growth: for each query, task evidence is held fixed while irrelevant sessions are added. The protocol logs agent--memory trajectories and reports four diagnostics: budget-compliant reliability, tail memory-call burden, failure-regime decomposition, and the usable-scale boundary where reliability falls below the target. Applied to LongMemEval and LoCoMo across flat, planar, and hierarchical memory interfaces, the protocol shows reliability loss is not a single phenomenon. On LongMemEval, HippoRAG stays within the two-call budget but loses 16--20 percentage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.