CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units
Yeeun Kang

TL;DR
This paper introduces CoVoSwitch, a synthetic code-switching dataset for 13 languages, and evaluates how multilingual translation models perform on code-switched text, revealing strengths and limitations in translating into English and non-English languages.
Contribution
The paper presents CoVoSwitch, a novel synthetic dataset for code-switching translation, and provides comprehensive analysis of model performance on code-switched speech in multiple languages.
Findings
Models perform better on code-switching into English than non-English.
Low-resource languages benefit most when translating into English.
Models struggle with non-English tokens and hallucinate words absent in source sentences.
Abstract
Multilingual code-switching research is often hindered by the lack and linguistically biased status of available datasets. To expand language representation, we synthesize code-switching data by replacing intonation units detected through PSST, a speech segmentation model fine-tuned from OpenAI's Whisper, using a speech-to-text translation dataset, CoVoST 2. With our dataset, CoVoSwitch, spanning 13 languages, we evaluate the code-switching translation performance of two multilingual translation models, M2M-100 418M and NLLB-200 600M. We reveal that the inclusion of code-switching units results in higher translation performance than monolingual settings and that models are better at code-switching translation into English than non-English. Further, low-resource languages gain most from integration of code-switched units when translating into English but much less when translating into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
