A Partition Cover Approach to Tokenization

Jia Peng Lim; Shawn Tan; Davin Choo; Hady W. Lauw

arXiv:2501.06246·cs.CL·September 30, 2025

A Partition Cover Approach to Tokenization

Jia Peng Lim, Shawn Tan, Davin Choo, Hady W. Lauw

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces GreedTok, a new polynomial-time greedy algorithm for tokenization that outperforms BPE in compression and achieves better language model performance, by formulating tokenization as an optimization problem.

Contribution

It formulates tokenization as an NP-hard optimization problem, proposes a polynomial-time greedy solution, and demonstrates its superiority over BPE through empirical and pre-training evaluations.

Findings

01

GreedTok outperforms BPE and Unigram in compression.

02

GreedTok achieves comparable coverage to GreedWMC.

03

Pre-trained models with GreedTok have lower bits per byte.

Abstract

Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/ e)$ -approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE and Unigram on compression and achieves a covering score comparable to GreedWMC. Finally, our extensive pre-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

preferredai/pcatt
jaxOfficial

Videos

A Partition Cover Approach to Tokenization· slideslive

Taxonomy

TopicsParallel Computing and Optimization Techniques · Graph Theory and Algorithms · Algorithms and Data Compression

MethodsByte Pair Encoding