URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models

Ruiqi Yan; Xiquan Li; Wenxi Chen; Zhikang Niu; Chen Yang; Ziyang Ma; Kai Yu; Xie Chen

arXiv:2502.17810·cs.CL·August 12, 2025

URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models

Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, Xie Chen

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

URO-Bench is a comprehensive evaluation benchmark for end-to-end spoken dialogue models, covering multilingualism, multi-round dialogues, and paralinguistics to advance speech-to-speech AI research.

Contribution

This paper introduces URO-Bench, the first S2S benchmark evaluating SDMs across multiple complex speech and dialogue capabilities, filling a significant evaluation gap.

Findings

01

Current SDMs perform well in daily QA tasks.

02

They lag behind LLMs in instruction-following and suffer from catastrophic forgetting.

03

Performance in paralinguistic and audio understanding is subpar.

Abstract

Recent advances in large language models (LLMs) have driven significant progress in end-to-end spoken dialogue models (SDMs). In contrast to text-based LLMs, the evaluation framework for SDMs should encompass both cognitive dimensions (e.g., logical reasoning, knowledge) and speech-related aspects (e.g., paralinguistic cues, audio quality). However, there is still a lack of comprehensive evaluations for SDMs in speech-to-speech (S2S) scenarios. To address this gap, we propose URO-Bench, an extensive benchmark for SDMs. Notably, URO-Bench is the first S2S benchmark that covers evaluations about multilingualism, multi-round dialogues, and paralinguistics. Our benchmark is divided into two difficulty levels: basic track and pro track, each comprising 20 test sets, evaluating the spoken dialogue model's abilities in Understanding, Reasoning, and Oral conversation. Evaluations on our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
tutu0604/UltraVoice-SFT
model· 6 dl
6 dl

Datasets

tutu0604/UltraVoice
dataset· 286 dl
286 dl

Videos

URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models· underline

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications