Domain Curricula for Code-Switched MT at MixMT 2022
Lekan Raheem, Maab Elrashid

TL;DR
This paper explores domain curricula and training strategies for code-switched machine translation, demonstrating that continuous, strategically scheduled training improves performance across multiple domains compared to simple fine-tuning.
Contribution
It introduces a domain curriculum approach with continuous training and sentence alignment for code-switched MT, outperforming traditional fine-tuning methods.
Findings
Domain switching improves early domain performance
Continuous training with diverse data enhances overall results
Strategic data scheduling outperforms fine-tuning
Abstract
In multilingual colloquial settings, it is a habitual occurrence to compose expressions of text or speech containing tokens or phrases of different languages, a phenomenon popularly known as code-switching or code-mixing (CMX). We present our approach and results for the Code-mixed Machine Translation (MixMT) shared task at WMT 2022: the task consists of two subtasks, monolingual to code-mixed machine translation (Subtask-1) and code-mixed to monolingual machine translation (Subtask-2). Most non-synthetic code-mixed data are from social media but gathering a significant amount of this kind of data would be laborious and this form of data has more writing variation than other domains, so for both subtasks, we experimented with data schedules for out-of-domain data. We jointly learn multiple domains of text by pretraining and fine-tuning, combined with a sentence alignment objective. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
