A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank
Dan Malkin, Tomasz Limisiewicz, Gabriel Stanovsky

TL;DR
This paper investigates how the choice of pretraining languages influences cross-lingual transfer in BERT models, proposing a scalable method to identify beneficial language pairs and improve multilingual model performance.
Contribution
It introduces a quadratic-time method to estimate donor-recipient language relations, aiding in selecting optimal pretraining languages for better transfer performance.
Findings
Pretraining language choice significantly impacts downstream transfer.
The proposed method effectively identifies beneficial language pairs.
Results are consistent across diverse languages and tasks.
Abstract
We show that the choice of pretraining languages affects downstream cross-lingual transfer for BERT-based models. We inspect zero-shot performance in balanced data conditions to mitigate data size confounds, classifying pretraining languages that improve downstream performance as donors, and languages that are improved in zero-shot performance as recipients. We develop a method of quadratic time complexity in the number of languages to estimate these relations, instead of an exponential exhaustive computation of all possible combinations. We find that our method is effective on a diverse set of languages spanning different linguistic features and two downstream tasks. Our findings can inform developers of large-scale multilingual language models in choosing better pretraining configurations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
