How to Adapt Your Pretrained Multilingual Model to 1600 Languages
Abteen Ebrahimi, Katharina Kann

TL;DR
This paper evaluates methods for adapting pretrained multilingual models to over 1600 languages using the limited and domain-specific New Testament corpus, revealing continued pretraining as the most effective approach despite challenges.
Contribution
It provides the first large-scale evaluation of adaptation methods for many low-resource languages using a real-world, narrow-domain corpus, highlighting continued pretraining's effectiveness.
Findings
Performance drops for all methods but still improves over XLM-R.
Continued pretraining yields the best results.
Domain and source language influence adaptation effectiveness.
Abstract
Pretrained multilingual models (PMMs) enable zero-shot learning via cross-lingual transfer, performing best for languages seen during pretraining. While methods exist to improve performance for unseen languages, they have almost exclusively been evaluated using amounts of raw text only available for a small fraction of the world's languages. In this paper, we evaluate the performance of existing methods to adapt PMMs to new languages using a resource available for over 1600 languages: the New Testament. This is challenging for two reasons: (1) the small corpus size, and (2) the narrow domain. While performance drops for all approaches, we surprisingly still see gains of up to accuracy for part-of-speech tagging and F1 for NER on average over all languages as compared to XLM-R. Another unexpected finding is that continued pretraining, the simplest approach, performs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsXLM-R
