SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents

Changhao Jiang; Jiajun Sun; Yifei Cao; Jiabao Zhuang; Xinmeng Che; Hui Li; Xiaoran Fan; Ming Zhang; Junjie Ye; Shihan Dou; Zhiheng Xi; Jingqi Tong; Yilong Wu; Baoyu Fan; Tao Ji; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2508.02013·cs.CL·March 27, 2026

SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents

Changhao Jiang, Jiajun Sun, Yifei Cao, Jiabao Zhuang, Xinmeng Che, Hui Li, Xiaoran Fan, Ming Zhang, Junjie Ye, Shihan Dou, Zhiheng Xi, Jingqi Tong, Yilong Wu, Baoyu Fan, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

TL;DR

SpeechRole introduces a large-scale dataset and benchmark for evaluating speech role-playing agents, highlighting current strengths and limitations in speech expressiveness, prosody, and role fidelity.

Contribution

The paper presents SpeechRole, a comprehensive dataset and evaluation framework specifically designed for speech role-playing agents, filling a gap in existing research.

Findings

01

End-to-end SRPAs like GPT-4o Audio show high fluency and naturalness.

02

Open-source models lag in prosody and emotion accuracy.

03

System performance is heavily influenced by underlying language models.

Abstract

Speech is essential for realistic role-playing, yet existing work on role-playing agents largely centers on text, leaving Speech Role-Playing Agents (SRPAs) underexplored and without systematic evaluation. We introduce SpeechRole, a unified framework for developing and assessing SRPAs. SpeechRole-Data contains 98 roles and 111k speech-to-speech conversations with rich timbre and prosodic variation, providing large-scale resources for training SRPAs. SpeechRole-Eval offers a multidimensional benchmark that directly evaluates generated speech, preserving paralinguistic cues and measuring interaction ability, speech expressiveness, and role-playing fidelity. Experiments show that end-to-end SRPAs such as GPT-4o Audio achieve strong fluency and naturalness, but remain limited in prosody consistency and emotion appropriateness. In contrast, current open-source end-to-end models exhibit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.