A Formal Perspective on Byte-Pair Encoding

Vil\'em Zouhar; Clara Meister; Juan Luis Gastaldi; Li Du; Tim Vieira,; Mrinmaya Sachan; Ryan Cotterell

arXiv:2306.16837·cs.CL·September 4, 2024

A Formal Perspective on Byte-Pair Encoding

Vil\'em Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira,, Mrinmaya Sachan, Ryan Cotterell

PDF

1 Repo

TL;DR

This paper formalizes Byte-Pair Encoding as a combinatorial optimization problem, provides approximation guarantees, and introduces faster algorithms for implementation and optimality, enhancing understanding and efficiency of BPE in NLP.

Contribution

It formalizes BPE as a combinatorial optimization problem, proves approximation bounds, and develops faster algorithms for implementation and optimal BPE computation.

Findings

01

Approximation ratio of greedy BPE is at least 0.37.

02

Faster implementation reduces runtime from O(N M) to O(N log M).

03

Optimized brute-force algorithm with memoization for optimal BPE.

Abstract

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{σ ( μ ^{⋆} )} (1 - e^{- σ (μ^{⋆})})$ -approximation of an optimal merge sequence, where $σ (μ^{⋆})$ is the total backward curvature with respect to the optimal merge sequence $μ^{⋆}$ . Empirically the lower bound of the approximation is $\approx 0.37$ . We provide a faster implementation of BPE which improves the runtime complexity from $O (N M)$ to $\mathcal{O}\left(N \log…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zouharvi/formal-bpe
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsByte Pair Encoding