FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models
Konstantin Dobler, Gerard de Melo

TL;DR
FOCUS is a novel embedding initialization method that effectively transfers knowledge from a multilingual model to a new language with a specialized tokenizer, improving performance on various NLP tasks.
Contribution
It introduces FOCUS, a new technique for initializing embeddings for new tokenizers using source model information, enhancing multilingual model adaptation.
Findings
FOCUS outperforms random initialization in language modeling.
FOCUS improves downstream task performance (NLI, QA, NER).
Empirical results demonstrate effectiveness on multilingual models.
Abstract
Using model weights pretrained on a high-resource language as a warm start can reduce the need for data and compute to obtain high-quality language models for other, especially low-resource, languages. However, if we want to use a new tokenizer specialized for the target language, we cannot transfer the source model's embedding matrix. In this paper, we propose FOCUS - Fast Overlapping Token Combinations Using Sparsemax, a novel embedding initialization method that initializes the embedding matrix effectively for a new tokenizer based on information in the source model's embedding matrix. FOCUS represents newly added tokens as combinations of tokens in the overlap of the source and target vocabularies. The overlapping tokens are selected based on semantic similarity in an auxiliary static token embedding space. We focus our study on using the multilingual XLM-R as a source model and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗konstantindobler/xlm-roberta-base-focus-germanmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗konstantindobler/xlm-roberta-base-focus-arabicmodel· 1 dl· ♡ 11 dl♡ 1
- 🤗konstantindobler/xlm-roberta-base-focus-kiswahilimodel· 2 dl2 dl
- 🤗konstantindobler/xlm-roberta-base-focus-isixhosamodel· 3 dl· ♡ 13 dl♡ 1
- 🤗konstantindobler/xlm-roberta-base-focus-hausamodel· 2 dl· ♡ 12 dl♡ 1
- 🤗konstantindobler/xlm-roberta-base-focus-extend-germanmodel· 2 dl2 dl
- 🤗konstantindobler/xlm-roberta-base-focus-extend-arabicmodel· 2 dl2 dl
- 🤗konstantindobler/xlm-roberta-base-focus-extend-kiswahilimodel· 2 dl2 dl
- 🤗konstantindobler/xlm-roberta-base-focus-extend-hausamodel· 2 dl2 dl
- 🤗konstantindobler/xlm-roberta-base-focus-extend-isixhosamodel· 3 dl· ♡ 13 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsSparsemax · XLM-R
