
TL;DR
BlockBPE is a GPU-optimized, parallel byte-pair encoding implementation that significantly accelerates tokenization in large language model pipelines, especially for batch inference, with minimal quality loss.
Contribution
It introduces a novel parallel GPU algorithm for BPE tokenization that removes regex pre-tokenization, achieving near linear-time complexity and higher throughput.
Findings
Up to 2x higher throughput than tiktoken
Up to 2.5x higher throughput than HuggingFace Tokenizers
Maintains similar generation quality with minimal loss
Abstract
Tokenization is a critical preprocessing step in large language model pipelines, yet widely-used implementations remain CPU-bound and suboptimal for batch inference workflows on GPU. We present BlockBPE, a parallel GPU implementation of byte-pair encoding (BPE) that achieves near linear-time complexity under realistic assumptions and is optimized for high-throughput, batch inference. Unlike existing Rust-based tokenizers such as HuggingFace Tokenizers or OpenAI's tiktoken-whose runtimes are dominated by Regex pre-tokenization and exhibit runtime-BlockBPE eliminates the Regex pre-tokenization which leads to small loss in generation quality, but enables highly parallelized token merges within thread blocks, reducing overall complexity to where . On high-batch inference workloads, BlockBPE achieves up to 2x higher throughput than tiktoken and 2.5x over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Distributed and Parallel Computing Systems
