PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?

Lingfeng Zhou; Jialing Zhang; Jin Gao; Mohan Jiang; Dequan Wang

arXiv:2508.10014·cs.CL·August 15, 2025

PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?

Lingfeng Zhou, Jialing Zhang, Jin Gao, Mohan Jiang, Dequan Wang

PDF

1 Datasets

TL;DR

PersonaEval is a benchmark designed to assess whether large language models can accurately identify speaking roles in dialogues, revealing current models' limitations compared to human performance in role attribution tasks.

Contribution

This work introduces PersonaEval, the first benchmark for testing LLMs' ability to recognize speaking roles, highlighting their current inadequacies for role-play evaluation.

Findings

01

LLMs achieve only ~69% accuracy in role identification

02

Humans perform near 91% accuracy in the same task

03

Reliable evaluation depends on strong, human-like reasoning in models

Abstract

Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

lingfengzhou/PersonaEval
dataset· 213 dl
213 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.