STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?
Hanxiang Chao, Yihan Bai, Rui Sheng, Tianle Li, Yushi Sun

TL;DR
This paper introduces STALE, a benchmark for evaluating LLMs' ability to detect and adapt to invalidated memories caused by implicit conflicts, highlighting a significant gap in current models' capabilities.
Contribution
The paper presents STALE, a comprehensive benchmark with 400 conflict scenarios, and evaluates LLMs' performance, revealing substantial challenges in memory revision and state awareness.
Findings
LLMs achieve only 55.2% accuracy in detecting outdated memories.
Models often accept outdated assumptions in user queries.
Explicit state adjudication improves memory revision capabilities.
Abstract
Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
