Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

Zeli Su; Ziyin Zhang; Guixian Xu; Jianing Liu; XU Han; Ting Zhang; Yushuang Dong

arXiv:2502.10852·cs.CL·May 30, 2025

Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

Zeli Su, Ziyin Zhang, Guixian Xu, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces a novel framework that reuses weights between encoder and decoder in multilingual models, significantly improving text generation for extremely low-resource languages by leveraging shared semantic spaces.

Contribution

The paper proposes a shared weights pretraining framework for multilingual models, enabling effective low-resource language generation and demonstrating its success on Chinese minority languages.

Findings

01

XLM-SWCM outperforms larger models on downstream tasks.

02

Shared weights improve low-resource language generalization.

03

Framework is effective across multiple low-resource languages.

Abstract

While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

asd765973346/xlm-swcm
pytorch

Models

🤗
KEVVVV/xlm-swcm
model· ♡ 3
♡ 3

Videos

Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages· underline

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsXLM-R · LLaMA