WECHSEL: Effective initialization of subword embeddings for   cross-lingual transfer of monolingual language models

Benjamin Minixhofer; Fabian Paischer; Navid Rekabsaz

arXiv:2112.06598·cs.CL·September 13, 2022

WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

Benjamin Minixhofer, Fabian Paischer, Navid Rekabsaz

PDF

1 Repo 7 Models

TL;DR

WECHSEL is a novel method for efficiently transferring pretrained monolingual language models to new languages by reinitializing subword embeddings using multilingual static word embeddings, reducing training effort and environmental impact.

Contribution

The paper introduces WECHSEL, a new approach for cross-lingual transfer of language models that leverages multilingual embeddings to initialize subword embeddings in target languages.

Findings

01

WECHSEL outperforms existing cross-lingual transfer methods.

02

It reduces training effort by up to 64 times.

03

It improves performance on low-resource languages.

Abstract

Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a novel method -- called WECHSEL -- to efficiently and effectively transfer pretrained LMs to new languages. WECHSEL can be applied to any model which uses subword-based tokenization and learns an embedding for each subword. The tokenizer of the source model (in English) is replaced with a tokenizer in the target language and token embeddings are initialized such that they are semantically similar to the English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer the English…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cpjku/wechsel
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Cosine Annealing · Linear Warmup With Linear Decay · Residual Connection · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization