Theoretical Analysis of Byte-Pair Encoding
L\'aszl\'o Kozma, Johannes Voderholzer

TL;DR
This paper provides a theoretical analysis of Byte-Pair Encoding (BPE), showing its optimization problem is computationally hard but that BPE still approximates optimal compression within a certain factor, explaining its practical success.
Contribution
It proves the APX-completeness of the BPE optimization problem and establishes worst-case approximation bounds for BPE's compression utility.
Findings
BPE's optimization problem is APX-complete.
BPE approximates optimal compression within a factor of 0.333 to 0.625.
First rigorous guarantees on BPE's compression utility for all inputs.
Abstract
Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in grammar-based text compression. It is employed in a variety of language processing tasks such as machine translation or large language model (LLM) pretraining, to create a token dictionary of a prescribed size. Most evaluations of BPE to date are empirical, and the reasons for its good practical performance are not well understood. In this paper we focus on the optimization problem underlying BPE: finding a pair encoding that achieves optimal compression utility. We show that this problem is APX-complete, indicating that it is unlikely to admit a polynomial-time approximation scheme. This answers, in a stronger form, a question recently raised by Zouhar et al. On the positive side, we show that BPE approximates the compression utility of the optimal pair encoding to a worst-case factor between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing · Cellular Automata and Applications · Algorithms and Data Compression
MethodsByte Pair Encoding · Focus
