Synthetic Data Generation and Joint Learning for Robust Code-Mixed   Translation

Kartik Kartik; Sanjana Soni; Anoop Kunchukuttan; Tanmoy Chakraborty,; Md Shad Akhtar

arXiv:2403.16771·cs.CL·May 1, 2024·1 cites

Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation

Kartik Kartik, Sanjana Soni, Anoop Kunchukuttan, Tanmoy Chakraborty,, Md Shad Akhtar

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a synthetic Hinglish-English dataset and a joint-training model, RCMT, to improve code-mixed translation robustness, demonstrating superior performance and adaptability to low-resource and noisy scenarios.

Contribution

The paper presents HINMIX, a large synthetic Hinglish-English corpus, and RCMT, a novel robust joint-training model for code-mixed translation handling noise and low-resource challenges.

Findings

01

RCMT outperforms state-of-the-art methods in code-mixed translation.

02

HINMIX provides a large synthetic parallel corpus for Hinglish-English.

03

RCMT demonstrates zero-shot adaptability to Bengalish-English translation.

Abstract

The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka code-mixed language) in a single utterance. This has resulted a formidable challenge for the computational models due to the scarcity of annotated data and presence of noise. A potential solution to mitigate the data scarcity problem in low-resource setup is to leverage existing data in resource-rich language through translation. In this paper, we tackle the problem of code-mixed (Hinglish and Bengalish) to English machine translation. First, we synthetically develop HINMIX, a parallel corpus of Hinglish to English, with ~4.2M sentence pairs. Subsequently, we propose RCMT, a robust perturbation based joint-training model that learns to handle noise in the real-world code-mixed text by parameter sharing across clean and noisy words. Further, we show the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

kartikagg98/HINMIX_hi-en
dataset· 497 dl
497 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques