RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

Riccardo Rosati; Edoardo Colucci; Massimiliano Bolognini; Adriano Mancini; Paolo Sernani

arXiv:2604.11655·cs.CL·April 14, 2026

RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

Riccardo Rosati, Edoardo Colucci, Massimiliano Bolognini, Adriano Mancini, Paolo Sernani

PDF

TL;DR

RPA-Check is a comprehensive multi-stage framework for objectively evaluating LLM-based role-playing agents, addressing limitations of traditional NLP metrics in complex, constraint-heavy environments.

Contribution

It introduces a novel, structured evaluation pipeline combining qualitative criteria, granular indicators, semantic filtering, and LLM-based judgment, validated on legal role-playing scenarios.

Findings

01

Smaller, instruction-tuned models outperform larger ones in procedural consistency.

02

The framework effectively identifies trade-offs between model size, reasoning depth, and stability.

03

RPA-Check offers a standardized metric for generative agent evaluation.

Abstract

The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.