VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models
Heyang Liu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu, Yiqi Li, Yixuan Hou, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

TL;DR
VocalBench provides a comprehensive benchmark for evaluating speech conversational abilities of models across multiple dimensions, addressing gaps in real-world scenario testing and enabling better comparison of current speech interaction systems.
Contribution
This paper introduces VocalBench, a new benchmark with 24,000 instances for assessing speech models' semantic, acoustic, conversational, and robustness capabilities in English and Mandarin.
Findings
Current models face common challenges in speech interaction tasks.
VocalBench reveals gaps in existing speech models' capabilities.
Benchmark facilitates targeted improvements for next-generation speech systems.
Abstract
Speech large language models (SpeechLLMs) have extended human-machine interactions from the text modality to the dynamic speech domain. Spoken dialogues convey diverse information, including semantic concepts, acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models lack instances mimicking real scenarios and predominantly focus on the performance of distinct aspects, lacking a comprehensive comparison of critical capabilities between current routines. To address this gap, we propose VocalBench to assess the speech conversational abilities, comprising around 24k carefully curated instances of both English and Mandarin across four key dimensions - semantic quality, acoustic performance, conversational abilities, and robustness, covering 14 user-oriented characters. Experiments on 27 mainstream models reveal the common…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis
MethodsFocus
