StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

Luanbo Wan; Weizhi Ma

arXiv:2506.13356·cs.CL·June 17, 2025

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

Luanbo Wan, Weizhi Ma

PDF

Open Access

TL;DR

StoryBench introduces a dynamic, interactive fiction-based benchmark to evaluate long-term memory in large language models, emphasizing complex reasoning and decision tracing in evolving narratives.

Contribution

It presents a novel benchmark framework with a new dataset, addressing limitations of existing tests by focusing on hierarchical reasoning and memory recall in narrative environments.

Findings

01

Benchmark effectively assesses LLMs' long-term memory capabilities.

02

Models show varied performance across reasoning and revision tasks.

03

The framework provides a reliable tool for future LTM evaluations.

Abstract

Long-term memory (LTM) is essential for large language models (LLMs) to achieve autonomous intelligence in complex, evolving environments. Despite increasing efforts in memory-augmented and retrieval-based architectures, there remains a lack of standardized benchmarks to systematically evaluate LLMs' long-term memory abilities. Existing benchmarks still face challenges in evaluating knowledge retention and dynamic sequential reasoning, and in their own flexibility, all of which limit their effectiveness in assessing models' LTM capabilities. To address these gaps, we propose a novel benchmark framework based on interactive fiction games, featuring dynamically branching storylines with complex reasoning structures. These structures simulate real-world scenarios by requiring LLMs to navigate hierarchical decision trees, where each choice triggers cascading dependencies across multi-turn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling