Theoretical Analysis of Byte-Pair Encoding

L\'aszl\'o Kozma; Johannes Voderholzer

arXiv:2411.08671·cs.DS·November 14, 2024·2 cites

Theoretical Analysis of Byte-Pair Encoding

L\'aszl\'o Kozma, Johannes Voderholzer

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of Byte-Pair Encoding (BPE), showing its optimization problem is computationally hard but that BPE still approximates optimal compression within a certain factor, explaining its practical success.

Contribution

It proves the APX-completeness of the BPE optimization problem and establishes worst-case approximation bounds for BPE's compression utility.

Findings

01

BPE's optimization problem is APX-complete.

02

BPE approximates optimal compression within a factor of 0.333 to 0.625.

03

First rigorous guarantees on BPE's compression utility for all inputs.

Abstract

Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in grammar-based text compression. It is employed in a variety of language processing tasks such as machine translation or large language model (LLM) pretraining, to create a token dictionary of a prescribed size. Most evaluations of BPE to date are empirical, and the reasons for its good practical performance are not well understood. In this paper we focus on the optimization problem underlying BPE: finding a pair encoding that achieves optimal compression utility. We show that this problem is APX-complete, indicating that it is unlikely to admit a polynomial-time approximation scheme. This answers, in a stronger form, a question recently raised by Zouhar et al. On the positive side, we show that BPE approximates the compression utility of the optimal pair encoding to a worst-case factor between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDNA and Biological Computing · Cellular Automata and Applications · Algorithms and Data Compression

MethodsByte Pair Encoding · Focus