UniCoM: A Universal Code-Switching Speech Generator

Sangmin Lee; Woojin Chung; Seyun Um; and Hong-Goo Kang

arXiv:2508.15244·cs.CL·August 22, 2025

UniCoM: A Universal Code-Switching Speech Generator

Sangmin Lee, Woojin Chung, Seyun Um, and Hong-Goo Kang

PDF

Open Access 1 Video

TL;DR

This paper introduces UniCoM, a novel pipeline for generating high-quality, natural code-switching speech samples, addressing data scarcity issues and aiding the development of multilingual speech recognition systems.

Contribution

UniCoM provides a new method for creating realistic code-switching speech datasets without altering sentence semantics, facilitating advancements in multilingual speech technology.

Findings

01

CS-FLEURS achieves high intelligibility and naturalness.

02

UniCoM-generated data performs comparably to existing datasets.

03

The approach enhances multilingual speech recognition and translation.

Abstract

Code-switching (CS), the alternation between two or more languages within a single speaker's utterances, is common in real-world conversations and poses significant challenges for multilingual speech technology. However, systems capable of handling this phenomenon remain underexplored, primarily due to the scarcity of suitable datasets. To resolve this issue, we propose Universal Code-Mixer (UniCoM), a novel pipeline for generating high-quality, natural CS samples without altering sentence semantics. Our approach utilizes an algorithm we call Substituting WORDs with Synonyms (SWORDS), which generates CS speech by replacing selected words with their translations while considering their parts of speech. Using UniCoM, we construct Code-Switching FLEURS (CS-FLEURS), a multilingual CS corpus designed for automatic speech recognition (ASR) and speech-to-text translation (S2TT). Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

UniCoM: A Universal Code-Switching Speech Generator· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems