WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

Yangzhuo Li; Shengpeng Ji; Yifu Chen; Tianle Liang; Haorong Ying; Yule Wang; Junbo Li; Jun Fang; Zhou Zhao

arXiv:2602.12135·cs.CL·February 16, 2026

WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

Yangzhuo Li, Shengpeng Ji, Yifu Chen, Tianle Liang, Haorong Ying, Yule Wang, Junbo Li, Jun Fang, Zhou Zhao

PDF

Open Access 1 Datasets

TL;DR

WavBench is a new benchmark for spoken dialogue models that evaluates reasoning, colloquial language, and paralinguistics across multiple challenging subsets, aiming to improve real-world conversational AI performance.

Contribution

It introduces a comprehensive tripartite framework and dataset to evaluate reasoning, colloquialism, and paralinguistics in spoken dialogue models, addressing gaps in current text-based benchmarks.

Findings

01

State-of-the-art models show varied performance across subsets.

02

The benchmark reveals strengths and weaknesses in reasoning and paralinguistic understanding.

03

Guides future development of more robust spoken dialogue systems.

Abstract

With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

WavBench/WavBench
dataset· 15k dl
15k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications