PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?
Lingfeng Zhou, Jialing Zhang, Jin Gao, Mohan Jiang, Dequan Wang

TL;DR
PersonaEval is a benchmark designed to assess whether large language models can accurately identify speaking roles in dialogues, revealing current models' limitations compared to human performance in role attribution tasks.
Contribution
This work introduces PersonaEval, the first benchmark for testing LLMs' ability to recognize speaking roles, highlighting their current inadequacies for role-play evaluation.
Findings
LLMs achieve only ~69% accuracy in role identification
Humans perform near 91% accuracy in the same task
Reliable evaluation depends on strong, human-like reasoning in models
Abstract
Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
