A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping   the Linguistic Blood Bank

Dan Malkin; Tomasz Limisiewicz; Gabriel Stanovsky

arXiv:2205.04086·cs.CL·May 10, 2022

A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank

Dan Malkin, Tomasz Limisiewicz, Gabriel Stanovsky

PDF

Open Access 1 Repo

TL;DR

This paper investigates how the choice of pretraining languages influences cross-lingual transfer in BERT models, proposing a scalable method to identify beneficial language pairs and improve multilingual model performance.

Contribution

It introduces a quadratic-time method to estimate donor-recipient language relations, aiding in selecting optimal pretraining languages for better transfer performance.

Findings

01

Pretraining language choice significantly impacts downstream transfer.

02

The proposed method effectively identifies beneficial language pairs.

03

Results are consistent across diverse languages and tasks.

Abstract

We show that the choice of pretraining languages affects downstream cross-lingual transfer for BERT-based models. We inspect zero-shot performance in balanced data conditions to mitigate data size confounds, classifying pretraining languages that improve downstream performance as donors, and languages that are improved in zero-shot performance as recipients. We develop a method of quadratic time complexity in the number of languages to estimate these relations, instead of an exponential exhaustive computation of all possible combinations. We find that our method is effective on a diverse set of languages spanning different linguistic features and two downstream tasks. Our findings can inform developers of large-scale multilingual language models in choosing better pretraining configurations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

slab-nlp/linguistic-blood-bank
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning