Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation
Kartik Kartik, Sanjana Soni, Anoop Kunchukuttan, Tanmoy Chakraborty,, Md Shad Akhtar

TL;DR
This paper introduces a synthetic Hinglish-English dataset and a joint-training model, RCMT, to improve code-mixed translation robustness, demonstrating superior performance and adaptability to low-resource and noisy scenarios.
Contribution
The paper presents HINMIX, a large synthetic Hinglish-English corpus, and RCMT, a novel robust joint-training model for code-mixed translation handling noise and low-resource challenges.
Findings
RCMT outperforms state-of-the-art methods in code-mixed translation.
HINMIX provides a large synthetic parallel corpus for Hinglish-English.
RCMT demonstrates zero-shot adaptability to Bengalish-English translation.
Abstract
The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka code-mixed language) in a single utterance. This has resulted a formidable challenge for the computational models due to the scarcity of annotated data and presence of noise. A potential solution to mitigate the data scarcity problem in low-resource setup is to leverage existing data in resource-rich language through translation. In this paper, we tackle the problem of code-mixed (Hinglish and Bengalish) to English machine translation. First, we synthetically develop HINMIX, a parallel corpus of Hinglish to English, with ~4.2M sentence pairs. Subsequently, we propose RCMT, a robust perturbation based joint-training model that learns to handle noise in the real-world code-mixed text by parameter sharing across clean and noisy words. Further, we show the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
