Large Vocabulary Size Improves Large Language Models

Sho Takase; Ryokan Ri; Shun Kiyono; Takuya Kato

arXiv:2406.16508·cs.CL·May 29, 2025·1 cites

Large Vocabulary Size Improves Large Language Models

Sho Takase, Ryokan Ri, Shun Kiyono, Takuya Kato

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that increasing subword vocabulary size enhances large language model performance and introduces a simple method for vocabulary adaptation in continual training, outperforming pre-defined vocabularies.

Contribution

It provides empirical evidence on the benefits of larger vocabularies and proposes a new approach for vocabulary replacement during continual training.

Findings

01

Larger vocabularies improve LLM performance

02

Replacing vocabularies in continual training yields better results

03

Simple vocabulary adaptation outperforms pre-defined vocabularies

Abstract

This paper empirically investigates the relationship between subword vocabulary size and the performance of large language models (LLMs) to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained language model is trained on a different target language. We introduce a simple method to use a new vocabulary instead of the pre-defined one. We show that using the new vocabulary outperforms the model with the vocabulary used in pre-training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Large Vocabulary Size Improves Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling