RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis

Enzhi Wang; Qicheng Li; Shiwan Zhao; Aobo Kong; Jiaming Zhou; Xi Yang; Yequan Wang; Yonghua Lin; Yong Qin

arXiv:2508.10015·cs.CL·August 15, 2025

RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis

Enzhi Wang, Qicheng Li, Shiwan Zhao, Aobo Kong, Jiaming Zhou, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin

PDF

1 Datasets

TL;DR

RealTalk-CN is a comprehensive Chinese speech-text dialogue dataset with diverse scenarios, disfluencies, and speaker variations, enabling robust evaluation of speech-based language models and introducing a novel cross-modal chat task.

Contribution

It introduces the first Chinese multi-turn speech-text dialogue dataset with disfluencies and speaker variations, and proposes a new cross-modal chat task for realistic speech-text interactions.

Findings

01

Effective evaluation of speech disfluency robustness

02

Insights into speaker variation sensitivity

03

Validation of cross-modal chat task performance

Abstract

In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

BAAI/RealTalk-CN
dataset· 34 dl
34 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.