LBPE: Long-token-first Tokenization to Improve Large Language Models

Haoran Lian; Yizhe Xiong; Zijia Lin; Jianwei Niu; Shasha Mo; Hui Chen,; Peng Liu; Guiguang Ding

arXiv:2411.05504·cs.CL·November 11, 2024

LBPE: Long-token-first Tokenization to Improve Large Language Models

Haoran Lian, Yizhe Xiong, Zijia Lin, Jianwei Niu, Shasha Mo, Hui Chen,, Peng Liu, Guiguang Ding

PDF

Open Access

TL;DR

LBPE introduces a novel tokenization method that prioritizes long tokens during encoding, reducing learning imbalance in LLMs and improving performance over traditional BPE across various tasks.

Contribution

LBPE proposes a new tokenization approach that prioritizes long tokens based on reverse length rank, addressing the imbalance issue in token frequency and enhancing LLM training.

Findings

01

LBPE outperforms BPE in multiple language modeling tasks.

02

LBPE reduces frequency disparity between short and long tokens.

03

LBPE improves overall model performance and learning stability.

Abstract

The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens, which can result in imbalanced learning issue across different tokens. To address that, we propose LBPE, which prioritizes long tokens during the encoding process. LBPE generates tokens according to their reverse ranks of token length rather than their ranks in the vocabulary, granting longer tokens higher priority during the encoding process. Consequently, LBPE smooths the frequency differences between short and long tokens, and thus mitigates the learning imbalance. Extensive experiments across diverse language modeling tasks demonstrate that LBPE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsByte Pair Encoding