Bit-level BPE: Below the byte boundary
Sangwhan Moon, Tatsuya Hiraoka, Naoaki Okazaki

TL;DR
This paper introduces Bit-level BPE, a lossless compression method for byte-level tokenization that reduces sequence length and computational cost in language models handling CJK and emoji characters.
Contribution
It presents a novel bit-level BPE technique that compresses byte sequences losslessly, improving efficiency in subword tokenization for diverse languages.
Findings
Reduces sequence length in CJK and emoji tokenization
Maintains lossless compression of byte sequences
Enhances computational efficiency during training and inference
Abstract
Byte-level fallbacks for subword tokenization have become a common practice in large language models. In particular, it has been demonstrated to be incredibly effective as a pragmatic solution for preventing OOV, especially in the context of larger models. However, breaking a character down to individual bytes significantly increases the sequence length for long-tail tokens in languages such as Chinese, Japanese, and Korean (CJK) and other character-diverse contexts such as emoji. The increased sequence length results in longer computation during both training and inference. In this work, we propose a simple compression technique that reduces the sequence length losslessly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Topic Modeling
