PingPong: A Benchmark for Role-Playing Language Models with User   Emulation and Multi-Model Evaluation

Ilya Gusev

arXiv:2409.06820·cs.CL·April 10, 2025·2 cites

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Ilya Gusev

PDF

Open Access 1 Repo

TL;DR

PingPong introduces a comprehensive benchmark for assessing role-playing language models through user emulation and multi-model evaluation, enabling dynamic, multi-turn conversational assessment in multiple languages.

Contribution

This work presents a novel benchmark framework that uses multiple models to simulate users and evaluate dialogue quality, advancing the assessment of role-playing capabilities in language models.

Findings

01

Strong correlation between automated and human evaluations

02

Effective multi-model setup for diverse role-playing scenarios

03

Validated across English and Russian language models

Abstract

We introduce a benchmark for evaluating the role-playing capabilities of language models. Our approach leverages different language models to simulate users in dynamic, multi-turn conversations and assess the resulting dialogues. Our methodology involves three main components: a player model that adopts a specific character role, an interrogator model that simulates user behavior in a specific situation, and a judge model ensemble that evaluates conversation quality with 3 metrics: character consistency, entertainment value, and language fluency. We evaluated more than 40 models in both English and Russian, with each model participating in 64 conversations with 8 characters and 8 situations. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ilyagusev/ping_pong_bench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Context-Aware Activity Recognition Systems