C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Chengqian Ma; Wei Tao; Yiwen Guo

arXiv:2507.22968·cs.CL·October 7, 2025

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Chengqian Ma, Wei Tao, Yiwen Guo

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces C3, a bilingual benchmark dataset and evaluation method for spoken dialogue models, addressing challenges like ambiguity and context-dependency in complex human conversations to improve their practical effectiveness.

Contribution

It provides a new bilingual benchmark dataset and an LLM-based evaluation approach specifically designed for assessing spoken dialogue models in complex conversational scenarios.

Findings

01

Benchmark dataset with 1,079 instances in English and Chinese

02

Evaluation method aligned with human judgment

03

Insights into SDMs' ability to handle ambiguity and context

Abstract

Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ChengqianMa/C3
dataset· 256 dl
256 dl

Videos

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations· underline

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling