Tokenization Is More Than Compression
Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri, Uzan, Yuval Pinter, Chris Tanner

TL;DR
This paper investigates the factors behind effective tokenization in NLP, challenging the idea that fewer tokens always improve performance, and highlights the importance of pre-tokenization and BPE initialization.
Contribution
It introduces PathPiece, a tokenizer that minimizes tokens for a given vocabulary, and provides comprehensive analysis of tokenization design choices affecting language model performance.
Findings
Fewer tokens do not necessarily lead to better downstream performance.
Pre-tokenization significantly impacts tokenizer effectiveness.
Using BPE to initialize vocabulary offers notable benefits.
Abstract
Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has been suggested that the effectiveness of BPE stems from its ability to condense text into a relatively small number of tokens. We test the hypothesis that fewer tokens lead to better downstream performance by introducing PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary. Through extensive experimentation we find this hypothesis not to be the case, casting doubt on the understanding of the reasons for effective tokenization. To examine which other factors play a role, we evaluate design decisions across all three phases of tokenization: pre-tokenization, vocabulary construction, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEmbedded Systems Design Techniques
MethodsByte Pair Encoding
