Improving Code-Switching Speech Recognition with TTS Data Augmentation

Yue Heng Yeo; Yuchen Hu; Shreyas Gopal; Yizhou Peng; Hexin Liu; and Eng Siong Chng

arXiv:2601.00935·eess.AS·January 6, 2026

Improving Code-Switching Speech Recognition with TTS Data Augmentation

Yue Heng Yeo, Yuchen Hu, Shreyas Gopal, Yizhou Peng, Hexin Liu, and Eng Siong Chng

PDF

Open Access

TL;DR

This paper demonstrates that using multilingual TTS models to generate synthetic code-switching speech data can significantly improve the accuracy of speech recognition systems in low-resource, conversational Chinese-English scenarios.

Contribution

It introduces a novel data augmentation approach using multilingual TTS to enhance code-switching ASR performance, addressing data scarcity issues.

Findings

01

Synthetic speech reduces MER from 12.1% to 10.1% on DevMan.

02

Synthetic speech reduces MER from 17.8% to 16.0% on DevSGE.

03

Multilingual TTS effectively improves ASR robustness in low-resource settings.

Abstract

Automatic speech recognition (ASR) for conversational code-switching speech remains challenging due to the scarcity of realistic, high-quality labeled speech data. This paper explores multilingual text-to-speech (TTS) models as an effective data augmentation technique to address this shortage. Specifically, we fine-tune the multilingual CosyVoice2 TTS model on the SEAME dataset to generate synthetic conversational Chinese-English code-switching speech, significantly increasing the quantity and speaker diversity of available training data. Our experiments demonstrate that augmenting real speech with synthetic speech reduces the mixed error rate (MER) from 12.1 percent to 10.1 percent on DevMan and from 17.8 percent to 16.0 percent on DevSGE, indicating consistent performance gains. These results confirm that multilingual TTS is an effective and practical tool for enhancing ASR robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research