On the Role of Parallel Data in Cross-lingual Transfer Learning

Machel Reid; Mikel Artetxe

arXiv:2212.10173·cs.CL·December 21, 2022

On the Role of Parallel Data in Cross-lingual Transfer Learning

Machel Reid, Mikel Artetxe

PDF

Open Access

TL;DR

This paper investigates whether improvements in cross-lingual transfer learning stem from parallel data itself or from modeling parallel interactions, finding synthetic data can be beneficial but real data yields the best results.

Contribution

It compares the effectiveness of synthetic, supervised, and gold parallel data in cross-lingual transfer, highlighting the potential of synthetic data and questioning traditional approaches.

Findings

01

Synthetic parallel data improves downstream tasks.

02

Real parallel data still provides the best performance.

03

Multilingual models underutilize monolingual data.

Abstract

While prior work has established that the use of parallel data is conducive for cross-lingual learning, it is unclear if the improvements come from the data itself, or if it is the modeling of parallel interactions that matters. Exploring this, we examine the usage of unsupervised machine translation to generate synthetic parallel data, and compare it to supervised machine translation and gold parallel data. We find that even model generated parallel data can be useful for downstream tasks, in both a general setting (continued pretraining) as well as the task-specific setting (translate-train), although our best results are still obtained using real parallel data. Our findings suggest that existing multilingual models do not exploit the full potential of monolingual data, and prompt the community to reconsider the traditional categorization of cross-lingual learning approaches.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification