Evaluating Memory Structure in LLM Agents
Alina Shutova, Alexandra Olenina, Ivan Vinogradov, Anton Sinitsin

TL;DR
This paper introduces StructMemEval, a benchmark to evaluate how well LLM-based agents can organize and utilize complex long-term memory structures, revealing current limitations and guiding future improvements.
Contribution
It proposes a new benchmark for testing complex memory organization in LLM agents, addressing a gap in existing factual recall benchmarks.
Findings
Simple retrieval-augmented LLMs struggle with structured memory tasks.
Memory agents can solve tasks when prompted on memory organization.
Modern LLMs often fail to recognize memory structures without explicit prompts.
Abstract
Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly difficult to analyze their capabilities and guide future memory designs. Most long-term memory benchmarks focus on simple fact retention, multi-hop recall, and time-based changes. While undoubtedly important, these capabilities can often be achieved with simple retrieval-augmented LLMs and do not test complex memory hierarchies. To bridge this gap, we propose StructMemEval - a benchmark that tests the agent's ability to organize its long-term memory, not just factual recall. We gather a suite of tasks that humans solve by organizing their knowledge in a specific structure: transaction ledgers, to-do lists, trees and others. Our initial experiments show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · AI in Service Interactions · Personal Information Management and User Behavior
