Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages

Tuan Nguyen; Huy-Dat Tran

arXiv:2506.14177·cs.CL·June 18, 2025

Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages

Tuan Nguyen, Huy-Dat Tran

PDF

Open Access

TL;DR

This paper explores training code-switching automatic speech recognition systems using synthetic data generated by phrase-level mixing, demonstrating improved performance on under-resourced Southeast Asian language pairs without relying on real code-switch data.

Contribution

It introduces a novel phrase-level mixing method to generate synthetic code-switching data and establishes a new benchmark for under-resourced language pairs, showing effective model fine-tuning without real CS data.

Findings

01

Synthetic CS data improves ASR performance on monolingual and CS tests.

02

BM-EN language pair benefits most from the proposed method.

03

Cost-effective approach for developing CS-ASR systems.

Abstract

Code-switching (CS), common in multilingual settings, presents challenges for ASR due to scarce and costly transcribed data caused by linguistic complexity. This study investigates building CS-ASR using synthetic CS data. We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns. Utilizing monolingual augmented with synthetic phrase-mixed CS data to fine-tune large pretrained ASR models (Whisper, MMS, SeamlessM4T). This paper focuses on three under-resourced Southeast Asian language pairs: Malay-English (BM-EN), Mandarin-Malay (ZH-BM), and Tamil-English (TA-EN), establishing a new comprehensive benchmark for CS-ASR to evaluate the performance of leading ASR models. Experimental results show that the proposed training strategy enhances ASR performance on monolingual and CS tests, with BM-EN showing highest gains, then TA-EN and ZH-BM. This finding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Multilingual Education and Policy