Tokenization Is More Than Compression

Craig W. Schmidt; Varshini Reddy; Haoran Zhang; Alec Alameddine; Omri; Uzan; Yuval Pinter; Chris Tanner

arXiv:2402.18376·cs.CL·October 8, 2024·1 cites

Tokenization Is More Than Compression

Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri, Uzan, Yuval Pinter, Chris Tanner

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper investigates the factors behind effective tokenization in NLP, challenging the idea that fewer tokens always improve performance, and highlights the importance of pre-tokenization and BPE initialization.

Contribution

It introduces PathPiece, a tokenizer that minimizes tokens for a given vocabulary, and provides comprehensive analysis of tokenization design choices affecting language model performance.

Findings

01

Fewer tokens do not necessarily lead to better downstream performance.

02

Pre-tokenization significantly impacts tokenizer effectiveness.

03

Using BPE to initialize vocabulary offers notable benefits.

Abstract

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has been suggested that the effectiveness of BPE stems from its ability to condense text into a relatively small number of tokens. We test the hypothesis that fewer tokens lead to better downstream performance by introducing PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary. Through extensive experimentation we find this hypothesis not to be the case, casting doubt on the understanding of the reasons for effective tokenization. To examine which other factors play a role, we evaluate design decisions across all three phases of tokenization: pre-tokenization, vocabulary construction, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Tokenization Is More Than Compression· underline

Taxonomy

TopicsEmbedded Systems Design Techniques

MethodsByte Pair Encoding