Tokenisation via Convex Relaxations

Jan Tempus; Philip Whittington; Craig W. Schmidt; Dennis Komm; Tiago Pimentel

arXiv:2605.22821·cs.CL·May 22, 2026

Tokenisation via Convex Relaxations

Jan Tempus, Philip Whittington, Craig W. Schmidt, Dennis Komm, Tiago Pimentel

PDF

TL;DR

This paper introduces ConvexTok, a novel convex optimization-based algorithm for tokenisation that improves intrinsic metrics, model efficiency, and provides certifiable optimality bounds.

Contribution

It formulates tokeniser construction as a linear program, offering a new approach that outperforms greedy algorithms like BPE and Unigram.

Findings

01

ConvexTok improves intrinsic tokenisation metrics.

02

It enhances bits-per-byte (BpB) in language models.

03

ConvexTok provides certifiable bounds close to optimal.

Abstract

Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.