LBPE: Long-token-first Tokenization to Improve Large Language Models
Haoran Lian, Yizhe Xiong, Zijia Lin, Jianwei Niu, Shasha Mo, Hui Chen,, Peng Liu, Guiguang Ding

TL;DR
LBPE introduces a novel tokenization method that prioritizes long tokens during encoding, reducing learning imbalance in LLMs and improving performance over traditional BPE across various tasks.
Contribution
LBPE proposes a new tokenization approach that prioritizes long tokens based on reverse length rank, addressing the imbalance issue in token frequency and enhancing LLM training.
Findings
LBPE outperforms BPE in multiple language modeling tasks.
LBPE reduces frequency disparity between short and long tokens.
LBPE improves overall model performance and learning stability.
Abstract
The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens, which can result in imbalanced learning issue across different tokens. To address that, we propose LBPE, which prioritizes long tokens during the encoding process. LBPE generates tokens according to their reverse ranks of token length rather than their ranks in the vocabulary, granting longer tokens higher priority during the encoding process. Consequently, LBPE smooths the frequency differences between short and long tokens, and thus mitigates the learning imbalance. Extensive experiments across diverse language modeling tasks demonstrate that LBPE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsByte Pair Encoding
