Effective vocabulary expanding of multilingual language models for extremely low-resource languages

Jianyu Zheng

arXiv:2602.09388·cs.CL·February 11, 2026

Effective vocabulary expanding of multilingual language models for extremely low-resource languages

Jianyu Zheng

PDF

Open Access

TL;DR

This paper proposes a vocabulary expansion method for multilingual language models to support previously unsupported low-resource languages, improving task performance without degrading source language capabilities.

Contribution

It introduces a novel vocabulary expansion technique using bilingual dictionaries and targeted pre-training, enhancing low-resource language support in mPLMs.

Findings

01

Outperforms baseline in POS tagging and NER tasks

02

Achieves 0.54% and 2.60% improvements respectively

03

Maintains source language performance after adaptation

Abstract

Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model's vocabulary using a target language corpus. We then screen out a subset from the model's original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsICT in Developing Communities · Topic Modeling · Natural Language Processing Techniques