Bit-level BPE: Below the byte boundary

Sangwhan Moon; Tatsuya Hiraoka; Naoaki Okazaki

arXiv:2506.07541·cs.CL·June 10, 2025

Bit-level BPE: Below the byte boundary

Sangwhan Moon, Tatsuya Hiraoka, Naoaki Okazaki

PDF

Open Access

TL;DR

This paper introduces Bit-level BPE, a lossless compression method for byte-level tokenization that reduces sequence length and computational cost in language models handling CJK and emoji characters.

Contribution

It presents a novel bit-level BPE technique that compresses byte sequences losslessly, improving efficiency in subword tokenization for diverse languages.

Findings

01

Reduces sequence length in CJK and emoji tokenization

02

Maintains lossless compression of byte sequences

03

Enhances computational efficiency during training and inference

Abstract

Byte-level fallbacks for subword tokenization have become a common practice in large language models. In particular, it has been demonstrated to be incredibly effective as a pragmatic solution for preventing OOV, especially in the context of larger models. However, breaking a character down to individual bytes significantly increases the sequence length for long-tail tokens in languages such as Chinese, Japanese, and Korean (CJK) and other character-diverse contexts such as emoji. The increased sequence length results in longer computation during both training and inference. In this work, we propose a simple compression technique that reduces the sequence length losslessly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Topic Modeling