VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

Heyang Liu; Yuhao Wang; Ziyang Cheng; Hongcheng Liu; Yiqi Li; Yixuan Hou; Ronghua Wu; Qunshan Gu; Yanfeng Wang; Yu Wang

arXiv:2505.15727·cs.CL·January 14, 2026

VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

Heyang Liu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu, Yiqi Li, Yixuan Hou, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

PDF

Open Access 1 Repo 2 Datasets

TL;DR

VocalBench provides a comprehensive benchmark for evaluating speech conversational abilities of models across multiple dimensions, addressing gaps in real-world scenario testing and enabling better comparison of current speech interaction systems.

Contribution

This paper introduces VocalBench, a new benchmark with 24,000 instances for assessing speech models' semantic, acoustic, conversational, and robustness capabilities in English and Mandarin.

Findings

01

Current models face common challenges in speech interaction tasks.

02

VocalBench reveals gaps in existing speech models' capabilities.

03

Benchmark facilitates targeted improvements for next-generation speech systems.

Abstract

Speech large language models (SpeechLLMs) have extended human-machine interactions from the text modality to the dynamic speech domain. Spoken dialogues convey diverse information, including semantic concepts, acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models lack instances mimicking real scenarios and predominantly focus on the performance of distinct aspects, lacking a comprehensive comparison of critical capabilities between current routines. To address this gap, we propose VocalBench to assess the speech conversational abilities, comprising around 24k carefully curated instances of both English and Mandarin across four key dimensions - semantic quality, acoustic performance, conversational abilities, and robustness, covering 14 user-oriented characters. Experiments on 27 mainstream models reveal the common…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sjtu-omniagent/vocalbench
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis

MethodsFocus