CharacterBox: Evaluating the Role-Playing Capabilities of LLMs in Text-Based Virtual Worlds
Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haoxuan Li, Xu Chen, Xing, Xie, Ji-Rong Wen

TL;DR
This paper introduces CharacterBox, a simulation environment for evaluating LLMs' role-playing abilities through detailed character behavior trajectories, addressing limitations of existing evaluation methods.
Contribution
We propose CharacterBox, a novel sandbox for in-depth role-playing evaluation, and develop two fine-tuned models that perform competitively with GPT APIs.
Findings
CharacterBox enables comprehensive role-playing assessment.
Fine-tuned models CharacterNR and CharacterRM perform well.
Trajectory-based methods improve LLM evaluation accuracy.
Abstract
Role-playing is a crucial capability of Large Language Models (LLMs), enabling a wide range of practical applications, including intelligent non-player characters, digital twins, and emotional companions. Evaluating this capability in LLMs is challenging due to the complex dynamics involved in role-playing, such as maintaining character fidelity throughout a storyline and navigating open-ended narratives without a definitive ground truth. Current evaluation methods, which primarily focus on question-answering or conversational snapshots, fall short of adequately capturing the nuanced character traits and behaviors essential for authentic role-playing. In this paper, we propose CharacterBox, which is a simulation sandbox designed to generate situational fine-grained character behavior trajectories. These behavior trajectories enable a more comprehensive and in-depth evaluation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Semantic Web and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · Attention Dropout · Softmax · Cosine Annealing · Byte Pair Encoding · Linear Layer · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning
