CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English   Code-Switching Dialogues for Speech Recognition

Jiaming Zhou; Yujie Guo; Shiwan Zhao; Haoqin Sun; Hui Wang; Jiabei He,; Aobo Kong; Shiyao Wang; Xi Yang; Yequan Wang; Yonghua Lin; Yong Qin

arXiv:2502.18913·cs.CL·March 13, 2025

CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition

Jiaming Zhou, Yujie Guo, Shiwan Zhao, Haoqin Sun, Hui Wang, Jiabei He,, Aobo Kong, Shiyao Wang, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin

PDF

Open Access 1 Datasets

TL;DR

This paper introduces CS-Dialogue, a large-scale, 104-hour Mandarin-English code-switching speech dataset with full-length dialogues and transcriptions, aiming to advance ASR systems for naturalistic conversational code-switching scenarios.

Contribution

The paper presents a novel, extensive dataset of spontaneous Mandarin-English code-switching dialogues with complete transcriptions, filling a gap in resources for robust ASR development.

Findings

01

State-of-the-art models struggle with code-switching recognition.

02

Pre-trained models like Whisper have room for improvement.

03

Benchmark results highlight the dataset's complexity for ASR.

Abstract

Code-switching (CS), the alternation between two or more languages within a single conversation, presents significant challenges for automatic speech recognition (ASR) systems. Existing Mandarin-English code-switching datasets often suffer from limitations in size, spontaneity, and the lack of full-length dialogue recordings with transcriptions, hindering the development of robust ASR models for real-world conversational scenarios. This paper introduces CS-Dialogue, a novel large-scale Mandarin-English code-switching speech dataset comprising 104 hours of spontaneous conversations from 200 speakers. Unlike previous datasets, CS-Dialogue provides full-length dialogue recordings with complete transcriptions, capturing naturalistic code-switching patterns in continuous speech. We describe the data collection and annotation processes, present detailed statistics of the dataset, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

BAAI/CS-Dialogue
dataset· 299 dl
299 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing

MethodsAbsolute Position Encodings · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer