VietMix: A Naturally-Occurring Parallel Corpus and Augmentation Framework for Vietnamese-English Code-Mixed Machine Translation
Hieu Tran, Phuong-Anh Nguyen-Le, Huy Nghiem, Quang-Nhan Nguyen, Wei Ai, Marine Carpuat

TL;DR
This paper introduces VietMix, a new Vietnamese-English code-mixed parallel corpus, and a data augmentation framework that significantly improves translation quality for low-resource, informal language settings.
Contribution
The paper presents VietMix, the first expert-translated parallel corpus for Vietnamese-English code-mixed text, and a novel augmentation pipeline that enhances translation models in low-resource scenarios.
Findings
Models with augmented data outperform back-translation baselines by up to 3.5 xCOMET points.
Zero-shot models improve by up to 11.9 points with the proposed augmentation.
VietMix provides a valuable resource for Vietnamese-English code-mixed machine translation.
Abstract
Machine translation (MT) systems universally degrade when faced with code-mixed text. This problem is more acute for low-resource languages that lack dedicated parallel corpora. This work directly addresses this gap for Vietnamese-English, a language context characterized by challenges including orthographic ambiguity and the frequent omission of diacritics in informal text. We introduce VietMix, the first expert-translated, naturally occurring parallel corpus of Vietnamese-English code-mixed text. We establish VietMix's utility by developing a data augmentation pipeline that leverages iterative fine-tuning and targeted filtering. Experiments show that models augmented with our data outperform strong back-translation baselines by up to +3.5 xCOMET points and improve zero-shot models by up to +11.9 points. Our work delivers a foundational resource for a challenging language pair and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
