VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

Heyang Liu; Ziyang Cheng; Yuhao Wang; Hongcheng Liu; Yiqi Li; Ronghua Wu; Qunshan Gu; Yanfeng Wang; Yu Wang

arXiv:2511.08230·cs.CL·November 18, 2025

VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

Heyang Liu, Ziyang Cheng, Yuhao Wang, Hongcheng Liu, Yiqi Li, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

PDF

Open Access

TL;DR

VocalBench-zh is a comprehensive Mandarin speech interaction benchmark with 10 subsets and 10K instances, designed to evaluate and compare the speech conversational abilities of 14 models, revealing current challenges and guiding future improvements.

Contribution

The paper introduces VocalBench-zh, the first detailed Mandarin speech-to-speech benchmark with ability-level divisions and extensive datasets for systematic evaluation.

Findings

01

Current models face common speech interaction challenges.

02

VocalBench-zh enables fair comparison of Mandarin speech models.

03

Insights highlight the need for advanced speech interactive systems.

Abstract

The development of multi-modal large language models (LLMs) leads to intelligent approaches capable of speech interactions. As one of the most widely spoken languages globally, Mandarin is supported by most models to enhance their applicability and reach. However, the scarcity of comprehensive speech-to-speech (S2S) benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users. In this work, we propose VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters. The evaluation experiment on 14 mainstream models reveals the common challenges for current routes, and highlights the need for new insights into next-generation speech interactive systems. The evaluation codes and datasets will be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling