RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents
Riccardo Rosati, Edoardo Colucci, Massimiliano Bolognini, Adriano Mancini, Paolo Sernani

TL;DR
RPA-Check is a comprehensive multi-stage framework for objectively evaluating LLM-based role-playing agents, addressing limitations of traditional NLP metrics in complex, constraint-heavy environments.
Contribution
It introduces a novel, structured evaluation pipeline combining qualitative criteria, granular indicators, semantic filtering, and LLM-based judgment, validated on legal role-playing scenarios.
Findings
Smaller, instruction-tuned models outperform larger ones in procedural consistency.
The framework effectively identifies trade-offs between model size, reasoning depth, and stability.
RPA-Check offers a standardized metric for generative agent evaluation.
Abstract
The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
