LangSAMP: Language-Script Aware Multilingual Pretraining
Yihong Liu, Haotian Ye, Chunlan Ma, Mingyang Wang, Hinrich Sch\"utze

TL;DR
LangSAMP introduces language and script embeddings into multilingual pretraining, significantly improving cross-lingual transfer and capturing language-specific nuances, demonstrated on a highly multilingual corpus.
Contribution
It proposes a novel method that incorporates language and script embeddings into Transformer models, enhancing multilingual representation learning and transfer performance.
Findings
Outperforms baseline in zero-shot crosslingual tasks
Captures language and script nuances effectively
Enables better source language selection for transfer
Abstract
Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings -- learnable vectors assigned to individual languages. However, this places a significant burden on token representations to encode all language-specific information, which may hinder language neutrality. To address this limitation, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning. Specifically, we integrate these embeddings into the output of the Transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline in zero-shot crosslingual transfer across diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAI-based Problem Solving and Planning · Model-Driven Software Engineering Techniques · Semantic Web and Ontologies
MethodsSparse Evolutionary Training · Attentive Walk-Aggregating Graph Neural Network · XLM-R
