Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

Jiatong Shi; Jionghao Han; Yichen Lu; Santiago Pascual; Pengfei Wu; Chenye Cui; Shinji Watanabe; Chao Weng; Cong Zhou

arXiv:2511.01261·cs.SD·November 4, 2025

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

Jiatong Shi, Jionghao Han, Yichen Lu, Santiago Pascual, Pengfei Wu, Chenye Cui, Shinji Watanabe, Chao Weng, Cong Zhou

PDF

Open Access 4 Reviews

TL;DR

Speech-DRAME introduces a comprehensive framework for evaluating speech role-play models, combining new benchmarks, a fine-tuned evaluation model, and a role-play benchmark to improve assessment accuracy and reflect real-world nuances.

Contribution

The paper presents Speech-DRAME, a unified framework with new benchmarks and a specialized evaluation model that outperforms existing zero-shot methods in assessing speech role-play quality.

Findings

01

DRAME-Eval outperforms zero-shot ALLMs in correlation with human ratings.

02

Speech-DRAME provides the first comprehensive, reproducible foundation for speech role-play evaluation.

03

Benchmark resources and evaluation strategies improve assessment of nuanced speech qualities.

Abstract

Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. **Novel Framework**: The proposed framework offers an interesting approach by integrating role-play generation with a dual evaluation strategy. The introduction of EvalBench and RoleBench datasets provides a clear framework for the evaluation of speech-based role-playing tasks. 2. **Detailed Benchmark Design**: The inclusion of both Archetype and Realism evaluation strategies ensures that the proposed method addresses both large-scale and fine-grained human perception of speech quality. 3. **

Weaknesses

1. **Over-reliance on the Appendix**: A significant amount of important information is placed in the **Appendix**, which makes it difficult to follow the main arguments and understand the contributions in the body of the paper. A well-written paper should be **self-contained**, with all critical information included in the main text. 2. **Clarity of Motivation**: The paper lacks a clear motivation regarding the limitation of **zero-shot ALLMs** as evaluation judges. There is insufficient discuss

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper proposes two complementary evaluation strategies: Archetype Evaluation (synthetic data-based) and Realism Evaluation (real human speech-based). 2. It builds a comprehensive framework, including datasets, evaluation models, and benchmarks, and evaluates multiple proprietary and open-source models. 3. The fine-tuned Qwen2Audio model achieves superior evaluation quality compared to general-purpose Audio LLMs.

Weaknesses

1. Realism-based evaluation suffers from domain mismatch, as its training data also contains synthetic speech, reducing its usability despite being based on real human speech. 2. Realism Evaluation shows poor alignment with human perception (Spearman correlation of only 0.375), indicating limited reliability. 3. The framework only supports single-turn evaluations, failing to capture the coherence of multi-turn narratives. 4. The paper lacks audible demo cases for the speech role-play generation

Reviewer 03Rating 0Confidence 5

Strengths

The paper is desk-rejected since at the time of submission, the main text exceeds 9 pages.

Weaknesses

The paper is desk-rejected since at the time of submission, the main text exceeds 9 pages.

Reviewer 04Rating 0Confidence 3

Strengths

Exceeded the page limit. Desk Reject.

Weaknesses

Exceeded the page limit. Desk Reject.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Emotion and Mood Recognition · Multimodal Machine Learning Applications