TL;DR
This paper explores effective vocabulary expansion strategies for large language models in low-resource languages using only 0.01GB of target language data, aiming to improve inference speed while maintaining performance.
Contribution
It introduces novel methods for vocabulary expansion in low-resource settings, including embedding initialization and continual pre-training, validated across diverse languages and tasks.
Findings
Vocabulary expansion can significantly speed up inference in low-resource languages.
Effective strategies maintain competitive performance with minimal target language data.
Embedding initialization and continual pre-training are key to successful expansion.
Abstract
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗atsuki-yamaguchi/Llama-2-7b-hf-te-30K-1000-align-2x2ls-mtp-512model
- 🤗atsuki-yamaguchi/Llama-2-7b-hf-my-30K-align-mtp-512model· 1 dl1 dl
- 🤗atsuki-yamaguchi/Llama-3-8B-si-30K-100-randmodel
- 🤗atsuki-yamaguchi/gemma-2-9b-si-30K-alignmodel
- 🤗atsuki-yamaguchi/Llama-3-8B-si-30K-1000-randmodel
- 🤗atsuki-yamaguchi/gemma-2-9b-te-30K-alignmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗atsuki-yamaguchi/Llama-3-8B-te-30K-5000-meanmodel
- 🤗atsuki-yamaguchi/gemma-2-9b-te-30K-50-alignmodel
- 🤗atsuki-yamaguchi/Llama-3-8B-te-30K-100-randmodel
- 🤗atsuki-yamaguchi/gemma-2-9b-si-30K-1000-randmodel· 1 dl1 dl
Videos
Taxonomy
MethodsSparse Evolutionary Training
