StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns
Luanbo Wan, Weizhi Ma

TL;DR
StoryBench introduces a dynamic, interactive fiction-based benchmark to evaluate long-term memory in large language models, emphasizing complex reasoning and decision tracing in evolving narratives.
Contribution
It presents a novel benchmark framework with a new dataset, addressing limitations of existing tests by focusing on hierarchical reasoning and memory recall in narrative environments.
Findings
Benchmark effectively assesses LLMs' long-term memory capabilities.
Models show varied performance across reasoning and revision tasks.
The framework provides a reliable tool for future LTM evaluations.
Abstract
Long-term memory (LTM) is essential for large language models (LLMs) to achieve autonomous intelligence in complex, evolving environments. Despite increasing efforts in memory-augmented and retrieval-based architectures, there remains a lack of standardized benchmarks to systematically evaluate LLMs' long-term memory abilities. Existing benchmarks still face challenges in evaluating knowledge retention and dynamic sequential reasoning, and in their own flexibility, all of which limit their effectiveness in assessing models' LTM capabilities. To address these gaps, we propose a novel benchmark framework based on interactive fiction games, featuring dynamically branching storylines with complex reasoning structures. These structures simulate real-world scenarios by requiring LLMs to navigate hierarchical decision trees, where each choice triggers cascading dependencies across multi-turn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling
