TL;DR
WECHSEL is a novel method for efficiently transferring pretrained monolingual language models to new languages by reinitializing subword embeddings using multilingual static word embeddings, reducing training effort and environmental impact.
Contribution
The paper introduces WECHSEL, a new approach for cross-lingual transfer of language models that leverages multilingual embeddings to initialize subword embeddings in target languages.
Findings
WECHSEL outperforms existing cross-lingual transfer methods.
It reduces training effort by up to 64 times.
It improves performance on low-resource languages.
Abstract
Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a novel method -- called WECHSEL -- to efficiently and effectively transfer pretrained LMs to new languages. WECHSEL can be applied to any model which uses subword-based tokenization and learns an embedding for each subword. The tokenizer of the source model (in English) is replaced with a tokenizer in the target language and token embeddings are initialized such that they are semantically similar to the English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer the English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Finnish-NLP/roberta-large-wechsel-finnishmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗benjamin/gpt2-large-wechsel-ukrainianmodel· 189 dl· ♡ 8189 dl♡ 8
- 🤗benjamin/gpt2-wechsel-ukrainianmodel· 42 dl· ♡ 742 dl♡ 7
- 🤗benjamin/gpt2-wechsel-uyghurmodel· 15 dl· ♡ 115 dl♡ 1
- 🤗malteos/gpt2-xl-wechsel-germanmodel· 17 dl· ♡ 1117 dl♡ 11
- 🤗RichardErkhov/benjamin_-_gpt2-large-wechsel-ukrainian-4bitsmodel
- 🤗RichardErkhov/benjamin_-_gpt2-large-wechsel-ukrainian-8bitsmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Cosine Annealing · Linear Warmup With Linear Decay · Residual Connection · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization
