How does a Language-Specific Tokenizer affect LLMs?

Jean Seo; Jaeyoon Kim; SungJoo Byun; Hyopil Shin

arXiv:2502.12560·cs.CL·February 24, 2025

How does a Language-Specific Tokenizer affect LLMs?

Jean Seo, Jaeyoon Kim, SungJoo Byun, Hyopil Shin

PDF

Open Access

TL;DR

This paper investigates how language-specific tokenizers impact the performance and stability of Large Language Models, demonstrating that tailored tokenizers improve prediction confidence and output quality, especially in complex tasks.

Contribution

It introduces a Korean-specific extended tokenizer and empirically evaluates its effects on LLM behavior compared to basic tokenizers.

Findings

01

Extended tokenizer reduces confidence in incorrect predictions.

02

It decreases cross-entropy in complex tasks.

03

Provides more stable and less nonsensical outputs.

Abstract

The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds in two main stages: (1) the development of a Korean-specific extended tokenizer and (2) experiments to compare models with the basic tokenizer and the extended tokenizer through various Next Token Prediction tasks. Our in-depth analysis reveals that the extended tokenizer decreases confidence in incorrect predictions during generation and reduces cross-entropy in complex tasks, indicating a tendency to produce less nonsensical outputs. Consequently, the extended tokenizer provides stability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques