Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal
Haoran Lian, Yizhe Xiong, Jianwei Niu, Shasha Mo, Zhenpeng Su, Zijia, Lin, Hui Chen, Peng Liu, Jungong Han, Guiguang Ding

TL;DR
Scaffold-BPE improves text tokenization for large language models by removing infrequent scaffold tokens, reducing frequency imbalance, and enhancing model training and performance.
Contribution
It introduces a simple, parameter-free scaffold token removal mechanism to the BPE algorithm, addressing frequency imbalance issues in tokenization.
Findings
Outperforms original BPE in language modeling tasks
Enhances machine translation performance
Mitigates frequency imbalance in token representations
Abstract
Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus to generate a new token and keeps all generated tokens in the vocabulary, it unavoidably holds tokens that primarily act as components of a longer token and appear infrequently on their own. We term such tokens as Scaffold Tokens. Due to their infrequent occurrences in the text corpus, Scaffold Tokens pose a learning imbalance issue. To address that issue, we propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Algorithms and Data Compression · Advanced Data Storage Technologies
MethodsByte Pair Encoding
