ALIGN-MLM: Word Embedding Alignment is Crucial for Multilingual Pre-training
Henry Tang, Ameet Deshpande, Karthik Narasimhan

TL;DR
ALIGN-MLM introduces a novel pre-training objective that explicitly aligns word embeddings across languages, significantly improving zero-shot transfer performance especially between languages with different scripts and structures.
Contribution
The paper proposes ALIGN-MLM, a new pre-training method that emphasizes word embedding alignment, demonstrating its effectiveness over existing objectives in multilingual transfer tasks.
Findings
ALIGN-MLM outperforms XLM and MLM by 35 and 30 F1 points on POS-tagging.
Strong correlation between embedding alignment and transfer success (rho=0.727).
Explicitly aligning word embeddings enhances multilingual model transferability.
Abstract
Multilingual pre-trained models exhibit zero-shot cross-lingual transfer, where a model fine-tuned on a source language achieves surprisingly good performance on a target language. While studies have attempted to understand transfer, they focus only on MLM, and the large number of differences between natural languages makes it hard to disentangle the importance of different properties. In this work, we specifically highlight the importance of word embedding alignment by proposing a pre-training objective (ALIGN-MLM) whose auxiliary loss guides similar words in different languages to have similar word embeddings. ALIGN-MLM either outperforms or matches three widely adopted objectives (MLM, XLM, DICT-MLM) when we evaluate transfer between pairs of natural languages and their counterparts created by systematically modifying specific properties like the script. In particular, ALIGN-MLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Byte Pair Encoding · Dropout · Attention Dropout · Dense Connections · Layer Normalization · Residual Connection
