Effective vocabulary expanding of multilingual language models for extremely low-resource languages
Jianyu Zheng

TL;DR
This paper proposes a vocabulary expansion method for multilingual language models to support previously unsupported low-resource languages, improving task performance without degrading source language capabilities.
Contribution
It introduces a novel vocabulary expansion technique using bilingual dictionaries and targeted pre-training, enhancing low-resource language support in mPLMs.
Findings
Outperforms baseline in POS tagging and NER tasks
Achieves 0.54% and 2.60% improvements respectively
Maintains source language performance after adaptation
Abstract
Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model's vocabulary using a target language corpus. We then screen out a subset from the model's original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsICT in Developing Communities · Topic Modeling · Natural Language Processing Techniques
