RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

Hao Xiang; Tianyi Tang; Yang Su; Bowen Yu; An Yang; Fei Huang; Yichang Zhang; Yaojie Lu; Hongyu Lin; Xianpei Han; Jingren Zhou; Junyang Lin; Le Sun

arXiv:2507.20352·cs.CL·October 24, 2025

RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

Hao Xiang, Tianyi Tang, Yang Su, Bowen Yu, An Yang, Fei Huang, Yichang Zhang, Yaojie Lu, Hongyu Lin, Xianpei Han, Jingren Zhou, Junyang Lin, Le Sun

PDF

Open Access 1 Datasets 1 Video

TL;DR

RMTBench is a new bilingual, user-centric role-playing benchmark for LLMs that evaluates multi-turn dialogues based on user motivations, aiming to better reflect real-world applications and improve assessment accuracy.

Contribution

It introduces a comprehensive, user-focused role-playing benchmark with diverse characters and dialogue simulation, addressing limitations of previous character-centric evaluations.

Findings

01

Benchmark includes 80 characters and 8,000 dialogue rounds.

02

Evaluation based on user motivation alignment improves assessment relevance.

03

Provides a multi-turn dialogue simulation mechanism for realistic testing.

Abstract

Recent advancements in Large Language Models (LLMs) have shown outstanding potential for role-playing applications. Evaluating these capabilities is becoming crucial yet remains challenging. Existing benchmarks mostly adopt a \textbf{character-centric} approach, simplify user-character interactions to isolated Q&A tasks, and fail to reflect real-world applications. To address this limitation, we introduce RMTBench, a comprehensive \textbf{user-centric} bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds. RMTBench includes custom characters with detailed backgrounds and abstract characters defined by simple traits, enabling evaluation across various user scenarios. Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

xiangh/RMTBENCH
dataset· 10 dl
10 dl

Videos

RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education