Tokenisation over Bounded Alphabets is Hard

Violeta Kastreva; Philip Whittington; Dennis Komm; Tiago Pimentel

arXiv:2511.15709·cs.CL·November 20, 2025

Tokenisation over Bounded Alphabets is Hard

Violeta Kastreva, Philip Whittington, Dennis Komm, Tiago Pimentel

PDF

Open Access 3 Reviews

TL;DR

This paper proves that tokenisation over fixed-size alphabets, including binary and unary, is NP-complete, explaining the computational difficulty behind practical tokenisation algorithms like BPE.

Contribution

It establishes the NP-completeness of tokenisation over bounded alphabets, including binary and unary, highlighting fundamental computational barriers.

Findings

01

NP-completeness of tokenisation over binary alphabets

02

No polynomial-time approximation scheme exists for these problems

03

NP-completeness also holds for unary alphabets

Abstract

Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded $n$ -ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an $n$ -ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The main strength of the paper is that it closes a gap in earlier NP-completeness results, which assumed alphabets of unbounded size. The current paper shows that these hardness results hold even for small size alphabets.

Weaknesses

The scope of the paper may be more suitable for a conference on computational complexity. On the other hand the results are about an important problem in natural language processing. Therefore it may fit a section dedicated to computational complexity results within natural language processing.

Reviewer 02Rating 8Confidence 4

Strengths

The computational problem studied in this paper is highly relevant, and, given the prior work, the demonstrated impossibility of approximation algorithms with arbitrary precision represents a significant theoretical contribution. The results provide valuable insights into the computational complexity of a fundamental step in modern AI and NLP models. The paper is clearly structured and written fairly well. While I did not verify all proofs in detail (as they are presented in the appendix), the c

Weaknesses

While theoretically it is an interesting paper, it appears like the practical tokenizers work very well and not clear whether these results will have any impact on the progress of modern NLP models. Another weakness I find is that it appears complicated than it needs to be to define the computational problems. Some notational use appears non-standard. For example while $tok$ is a function by definition, they also use it as a set and use notations such as $|tok|$ which, to my understanding is the

Reviewer 03Rating 8Confidence 4

Strengths

The paper studies a practically motivated problem which is a key step in training natural language processing models. The result is strong and essentially resolves the question of tokenization with compression as an objective. The paper is well written, and the proofs are well motivated and easy to follow.

Weaknesses

No clear weaknesses (see minor comments below)

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory · Natural Language Processing Techniques