BlockBPE: Parallel BPE Tokenization

Amos You

arXiv:2507.11941·cs.CL·July 17, 2025

BlockBPE: Parallel BPE Tokenization

Amos You

PDF

Open Access

TL;DR

BlockBPE is a GPU-optimized, parallel byte-pair encoding implementation that significantly accelerates tokenization in large language model pipelines, especially for batch inference, with minimal quality loss.

Contribution

It introduces a novel parallel GPU algorithm for BPE tokenization that removes regex pre-tokenization, achieving near linear-time complexity and higher throughput.

Findings

01

Up to 2x higher throughput than tiktoken

02

Up to 2.5x higher throughput than HuggingFace Tokenizers

03

Maintains similar generation quality with minimal loss

Abstract

Tokenization is a critical preprocessing step in large language model pipelines, yet widely-used implementations remain CPU-bound and suboptimal for batch inference workflows on GPU. We present BlockBPE, a parallel GPU implementation of byte-pair encoding (BPE) that achieves near linear-time complexity under realistic assumptions and is optimized for high-throughput, batch inference. Unlike existing Rust-based tokenizers such as HuggingFace Tokenizers or OpenAI's tiktoken-whose runtimes are dominated by Regex pre-tokenization and exhibit $O (n lo g n)$ runtime-BlockBPE eliminates the Regex pre-tokenization which leads to small loss in generation quality, but enables highly parallelized token merges within thread blocks, reducing overall complexity to $O (n d)$ where $d ≪ n$ . On high-batch inference workloads, BlockBPE achieves up to 2x higher throughput than tiktoken and 2.5x over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Distributed and Parallel Computing Systems