Faster Superword Tokenization
Craig W. Schmidt, Chris Tanner, and Yuval Pinter

TL;DR
This paper introduces a significantly faster training method for superword tokenization algorithms like BoundlessBPE and SuperBPE, enabling practical use and open-source implementations.
Contribution
It presents a two-phase training approach that speeds up superword tokenization algorithms by over 600 times while maintaining identical results.
Findings
Training time reduced from 4.7 CPU days to under 10 minutes.
Achieved over 600x speedup in training superword tokenization algorithms.
Provided open-source Python and Rust implementations for practical use.
Abstract
Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend and improve BPE by relaxing this limitation and allowing the formation of superwords, which are combinations of pretokens that form phrases. However, previous implementations were impractical to train: for example, BoundlessBPE took 4.7 CPU days to train on 1GB of data. We show that supermerge candidates, two or more consecutive pretokens eligible to form a supermerge, can be aggregated by frequency much like regular pretokens. This avoids keeping full documents in memory, as the original implementations of BoundlessBPE and SuperBPE required, leading to a significant training speedup. We present a two-phase formulation of BoundlessBPE that separates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
