How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

Atsuki Yamaguchi; Aline Villavicencio; Nikolaos Aletras

arXiv:2406.11477·cs.CL·December 1, 2025

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras

PDF

2 Repos 10 Models 1 Video

TL;DR

This paper explores effective vocabulary expansion strategies for large language models in low-resource languages using only 0.01GB of target language data, aiming to improve inference speed while maintaining performance.

Contribution

It introduces novel methods for vocabulary expansion in low-resource settings, including embedding initialization and continual pre-training, validated across diverse languages and tasks.

Findings

01

Vocabulary expansion can significantly speed up inference in low-resource languages.

02

Effective strategies maintain competitive performance with minimal target language data.

03

Embedding initialization and continual pre-training are key to successful expansion.

Abstract

Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?· underline

Taxonomy

MethodsSparse Evolutionary Training