Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Gyeongje Cho, Yeonkyoun So, Chanwoo Park, Sangmin Lee, Sungmok Jung, Jaejin Lee

TL;DR
Thunder-Tok is a Korean tokenizer that reduces token fertility by using linguistically informed rules and entropy-based selection, leading to faster inference without performance loss.
Contribution
It introduces a novel rule-based, linguistically aligned tokenization method for Korean that effectively reduces token fertility and improves efficiency.
Findings
Reduces token fertility by approximately 10%
Improves inference speed by 10%
Maintains performance across downstream tasks
Abstract
This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
