Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages
C.M. Downey, Terra Blevins, Nora Goldfine, Shane Steinert-Threlkeld

TL;DR
This paper compares methods for replacing multilingual vocabularies with language-specific ones to improve low-resource language performance efficiently, highlighting the effectiveness of simple embedding re-initialization techniques.
Contribution
It introduces and systematically evaluates simple techniques for replacing cross-lingual vocabularies in multilingual models, showing their competitiveness with more complex methods.
Findings
Embedding-replacement techniques are inadequate for multilingual adaptation.
Replacing vocabularies with smaller, language-specific ones improves low-resource language performance.
Simple embedding re-initialization rivals more complex similarity-based methods.
Abstract
Pre-trained multilingual language models underpin a large portion of modern NLP tools outside of English. A strong baseline for specializing these models for specific languages is Language-Adaptive Pre-Training (LAPT). However, retaining a large cross-lingual vocabulary and embedding matrix comes at considerable excess computational cost during adaptation. In this study, we propose several simple techniques to replace a cross-lingual vocabulary with a compact, language-specific one. Namely, we address strategies for re-initializing the token embedding matrix after vocabulary specialization. We then provide a systematic experimental comparison of our techniques, in addition to the recently-proposed Focus method. We demonstrate that: 1) Embedding-replacement techniques in the monolingual transfer literature are inadequate for adapting multilingual models. 2) Replacing cross-lingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsFocus
