Rethinking Tokenization: Crafting Better Tokenizers for Large Language   Models

Jinbiao Yang

arXiv:2403.00417·cs.CL·March 4, 2024·1 cites

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Jinbiao Yang

PDF

Open Access

TL;DR

This paper introduces the LiB tokenizer, inspired by cognitive science, which learns an integrated vocabulary to improve language model performance by reducing token complexity and handling multiword expressions more effectively.

Contribution

It proposes the LiB model that autonomously learns an integrated vocabulary, combining subwords, words, and MWEs, inspired by human language processing principles.

Findings

01

LiB tokenizer outperforms existing tokenizers in evaluations.

02

The approach reduces token and type counts effectively.

03

Cognitive science principles can guide tokenizer design.

Abstract

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while controlling complexity. Despite subword tokenizers like Byte Pair Encoding (BPE) overcoming many word tokenizer limitations, they encounter difficulties in handling non-Latin languages and depend heavily on extensive training data and computational resources to grasp the nuances of multiword expressions (MWEs). This article argues that tokenizers, more than mere technical tools, should drawing inspiration from the cognitive science about human language processing. This study then introduces the "Principle of Least Effort" from cognitive science, that humans naturally seek to reduce cognitive effort, and discusses the benefits of this principle for tokenizer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsByte Pair Encoding