MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios
Yihang Ding, Wanke Xia, Yiting Zhao, Jinbo Su, Jialiang Yang, Zhengbo Zhang, Ke Wang, Wenming Yang

TL;DR
MemGround introduces a comprehensive benchmark for evaluating long-term memory in large language models within gamified, interactive scenarios, addressing limitations of static evaluation methods.
Contribution
It proposes a hierarchical framework and multi-dimensional metrics for assessing dynamic memory capabilities in LLMs during complex interactions.
Findings
State-of-the-art LLMs struggle with sustained dynamic tracking.
Models have difficulty with temporal event association.
Complex reasoning from long-term evidence remains challenging.
Abstract
Current evaluations of long-term memory in LLMs are fundamentally static. By fixating on simple retrieval and short-context inference, they neglect the multifaceted nature of complex memory systems, such as dynamic state tracking and hierarchical reasoning in continuous interactions. To overcome these limitations, we propose MemGround, a rigorous long-term memory benchmark natively grounded in rich, gamified interactive scenarios. To systematically assess these capabilities, MemGround introduces a three-tier hierarchical framework that evaluates Surface State Memory, Temporal Associative Memory, and Reasoning-Based Memory through specialized interactive tasks. Furthermore, to comprehensively quantify both memory utilization and behavioral trajectories, we propose a multi-dimensional metric suite comprising Question-Answer Score (QA Overall), Memory Fragments Unlocked (MFU), Memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
