TL;DR
This paper formalizes Byte-Pair Encoding as a combinatorial optimization problem, provides approximation guarantees, and introduces faster algorithms for implementation and optimality, enhancing understanding and efficiency of BPE in NLP.
Contribution
It formalizes BPE as a combinatorial optimization problem, proves approximation bounds, and develops faster algorithms for implementation and optimal BPE computation.
Findings
Approximation ratio of greedy BPE is at least 0.37.
Faster implementation reduces runtime from O(N M) to O(N log M).
Optimized brute-force algorithm with memoization for optimal BPE.
Abstract
Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a -approximation of an optimal merge sequence, where is the total backward curvature with respect to the optimal merge sequence . Empirically the lower bound of the approximation is . We provide a faster implementation of BPE which improves the runtime complexity from to $\mathcal{O}\left(N \log…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsByte Pair Encoding
