Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Gyeongje Cho; Yeonkyoun So; Chanwoo Park; Sangmin Lee; Sungmok Jung; Jaejin Lee

arXiv:2506.15138·cs.CL·June 19, 2025

Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Gyeongje Cho, Yeonkyoun So, Chanwoo Park, Sangmin Lee, Sungmok Jung, Jaejin Lee

PDF

Open Access

TL;DR

Thunder-Tok is a Korean tokenizer that reduces token fertility by using linguistically informed rules and entropy-based selection, leading to faster inference without performance loss.

Contribution

It introduces a novel rule-based, linguistically aligned tokenization method for Korean that effectively reduces token fertility and improves efficiency.

Findings

01

Reduces token fertility by approximately 10%

02

Improves inference speed by 10%

03

Maintains performance across downstream tasks

Abstract

This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques