NARRA-Gym for Evaluating Interactive Narrative Agents

Yue Huang; Yuchen Ma; Jiayi Ye; Wenjie Wang; Zipeng Ling; Xingjian Hu; Yuexing Hao; Zichen Chen; Zhangchen Xu; Yunhong He; Zhengqing Yuan; Yujun Zhou; Kehan Guo; Chaoran Chen; Toby Jia-Jun Li; Stefan Feuerriegel; Xiangliang Zhang

arXiv:2605.08503·cs.CL·May 12, 2026

NARRA-Gym for Evaluating Interactive Narrative Agents

Yue Huang, Yuchen Ma, Jiayi Ye, Wenjie Wang, Zipeng Ling, Xingjian Hu, Yuexing Hao, Zichen Chen, Zhangchen Xu, Yunhong He, Zhengqing Yuan, Yujun Zhou, Kehan Guo, Chaoran Chen, Toby Jia-Jun Li, Stefan Feuerriegel, Xiangliang Zhang

PDF

TL;DR

NARRA-Gym provides a comprehensive, executable benchmark environment for evaluating interactive narrative agents' ability to generate coherent, personalized stories while managing long-term context and user adaptation.

Contribution

It introduces NARRA-Gym, a novel environment for evaluating LLMs in interactive storytelling with detailed trajectory logging and multi-faceted assessment.

Findings

01

Significant variation in model performance across different evaluation dimensions.

02

Models fluent in story generation often struggle with robustness and personalization.

03

Interactive narrative serves as a valuable benchmark for long-horizon, user-adaptive LLM evaluation.

Abstract

Interactive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post-hoc ratings, and therefore miss whether models can jointly manage story generation, long-context state and pacing, character simulation, empathic personalization, and story-grounded artifacts. We introduce NARRA-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. We evaluate nine frontier LLMs using a controlled LLM-as-judge sweep over eight benchmark personas and a human evaluation in which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.