RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines

Pengfei Yu; Dongming Shen; Silin Meng; Jaewon Lee; Weisu Yin; Andrea; Yaoyun Cui; Zhenlin Xu; Yi Zhu; Xingjian Shi; Mu Li; Alex Smola

arXiv:2502.00595·cs.CL·February 4, 2025

RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines

Pengfei Yu, Dongming Shen, Silin Meng, Jaewon Lee, Weisu Yin, Andrea, Yaoyun Cui, Zhenlin Xu, Yi Zhu, Xingjian Shi, Mu Li, Alex Smola

PDF

Open Access

TL;DR

RPGBench introduces a comprehensive benchmark to evaluate large language models' ability to create and simulate text-based role-playing games, assessing creativity, coherence, and rule adherence through both objective and subjective measures.

Contribution

This work is the first to systematically evaluate LLMs as RPG engines using structured tasks and combined evaluation methods, setting a new standard for interactive storytelling assessment.

Findings

01

State-of-the-art LLMs generate engaging stories but struggle with consistent game mechanics.

02

Objective assessments verify rule adherence and state updates.

03

LLM-based judges evaluate content quality and role-playing depth.

Abstract

We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines. RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS). In GC, an LLM must craft a valid and playable RPG world using a structured event-state representation, ensuring logical coherence and proper termination conditions. In GS, the LLM simulates interactive gameplay across multiple rounds while consistently updating states and enforcing game rules. To comprehensively assess performance, RPGBench integrates objective and subjective evaluation methodologies. Objective measures verify adherence to event mechanics and check variable updates without requiring human intervention. Subjective measures, such as content interestingness, action quality, and role-playing capability, are evaluated via an LLM-as-a-judge framework, where a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques