Domain Curricula for Code-Switched MT at MixMT 2022

Lekan Raheem; Maab Elrashid

arXiv:2210.17463·cs.CL·November 1, 2022

Domain Curricula for Code-Switched MT at MixMT 2022

Lekan Raheem, Maab Elrashid

PDF

Open Access

TL;DR

This paper explores domain curricula and training strategies for code-switched machine translation, demonstrating that continuous, strategically scheduled training improves performance across multiple domains compared to simple fine-tuning.

Contribution

It introduces a domain curriculum approach with continuous training and sentence alignment for code-switched MT, outperforming traditional fine-tuning methods.

Findings

01

Domain switching improves early domain performance

02

Continuous training with diverse data enhances overall results

03

Strategic data scheduling outperforms fine-tuning

Abstract

In multilingual colloquial settings, it is a habitual occurrence to compose expressions of text or speech containing tokens or phrases of different languages, a phenomenon popularly known as code-switching or code-mixing (CMX). We present our approach and results for the Code-mixed Machine Translation (MixMT) shared task at WMT 2022: the task consists of two subtasks, monolingual to code-mixed machine translation (Subtask-1) and code-mixed to monolingual machine translation (Subtask-2). Most non-synthetic code-mixed data are from social media but gathering a significant amount of this kind of data would be laborious and this form of data has more writing variation than other domains, so for both subtasks, we experimented with data schedules for out-of-domain data. We jointly learn multiple domains of text by pretraining and fine-tuning, combined with a sentence alignment objective. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling