Tokenisation over Bounded Alphabets is Hard
Violeta Kastreva, Philip Whittington, Dennis Komm, Tiago Pimentel

TL;DR
This paper proves that tokenisation over fixed-size alphabets, including binary and unary, is NP-complete, explaining the computational difficulty behind practical tokenisation algorithms like BPE.
Contribution
It establishes the NP-completeness of tokenisation over bounded alphabets, including binary and unary, highlighting fundamental computational barriers.
Findings
NP-completeness of tokenisation over binary alphabets
No polynomial-time approximation scheme exists for these problems
NP-completeness also holds for unary alphabets
Abstract
Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded -ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an -ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation…
Peer Reviews
Decision·ICLR 2026 Poster
The main strength of the paper is that it closes a gap in earlier NP-completeness results, which assumed alphabets of unbounded size. The current paper shows that these hardness results hold even for small size alphabets.
The scope of the paper may be more suitable for a conference on computational complexity. On the other hand the results are about an important problem in natural language processing. Therefore it may fit a section dedicated to computational complexity results within natural language processing.
The computational problem studied in this paper is highly relevant, and, given the prior work, the demonstrated impossibility of approximation algorithms with arbitrary precision represents a significant theoretical contribution. The results provide valuable insights into the computational complexity of a fundamental step in modern AI and NLP models. The paper is clearly structured and written fairly well. While I did not verify all proofs in detail (as they are presented in the appendix), the c
While theoretically it is an interesting paper, it appears like the practical tokenizers work very well and not clear whether these results will have any impact on the progress of modern NLP models. Another weakness I find is that it appears complicated than it needs to be to define the computational problems. Some notational use appears non-standard. For example while $tok$ is a function by definition, they also use it as a set and use notations such as $|tok|$ which, to my understanding is the
The paper studies a practically motivated problem which is a key step in training natural language processing models. The result is strong and essentially resolves the question of tokenization with compression as an objective. The paper is well written, and the proofs are well motivated and easy to follow.
No clear weaknesses (see minor comments below)
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · Natural Language Processing Techniques
