How does a Language-Specific Tokenizer affect LLMs?
Jean Seo, Jaeyoon Kim, SungJoo Byun, Hyopil Shin

TL;DR
This paper investigates how language-specific tokenizers impact the performance and stability of Large Language Models, demonstrating that tailored tokenizers improve prediction confidence and output quality, especially in complex tasks.
Contribution
It introduces a Korean-specific extended tokenizer and empirically evaluates its effects on LLM behavior compared to basic tokenizers.
Findings
Extended tokenizer reduces confidence in incorrect predictions.
It decreases cross-entropy in complex tasks.
Provides more stable and less nonsensical outputs.
Abstract
The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds in two main stages: (1) the development of a Korean-specific extended tokenizer and (2) experiments to compare models with the basic tokenizer and the extended tokenizer through various Next Token Prediction tasks. Our in-depth analysis reveals that the extended tokenizer decreases confidence in incorrect predictions during generation and reduces cross-entropy in complex tasks, indicating a tendency to produce less nonsensical outputs. Consequently, the extended tokenizer provides stability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
