SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

Peng Xie; Xingyuan Liu; Tsz Wai Chan; Yequan Bie; Yangqiu Song; Yang Wang; Hao Chen; Kani Chen

arXiv:2506.00087·cs.CL·June 3, 2025

SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

Peng Xie, Xingyuan Liu, Tsz Wai Chan, Yequan Bie, Yangqiu Song, Yang Wang, Hao Chen, Kani Chen

PDF

Open Access 2 Datasets

TL;DR

This paper introduces SwitchLingua, a large-scale, diverse multilingual and multi-ethnic code-switching dataset, along with a new semantic-aware evaluation metric for improved ASR system assessment.

Contribution

It presents the first extensive multilingual, multi-ethnic code-switching dataset and a semantic-aware error rate metric to better evaluate multilingual ASR performance.

Findings

01

420K textual code-switching samples across 12 languages

02

Over 80 hours of audio from 174 speakers of diverse backgrounds

03

Introduction of the Semantic-Aware Error Rate (SAER) metric

Abstract

Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (CSTTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf{LinguaMaster}, a multi-agent collaboration framework specifically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultilingual Education and Policy