Beyond Text Compression: Evaluating Tokenizers Across Scales

Jonas F. Lotz; Ant\'onio V. Lopes; Stephan Peitz; Hendra Setiawan; Leonardo Emili

arXiv:2506.03101·cs.CL·June 4, 2025

Beyond Text Compression: Evaluating Tokenizers Across Scales

Jonas F. Lotz, Ant\'onio V. Lopes, Stephan Peitz, Hendra Setiawan, Leonardo Emili

PDF

Open Access

TL;DR

This paper introduces a scalable, efficient framework for evaluating tokenizers across different languages and scales, revealing their varied impacts on multilingual versus English tasks and proposing new intrinsic metrics for better tokenizer assessment.

Contribution

It presents a novel intrinsic evaluation framework for tokenizers, leveraging scaling consistency and Zipf-inspired metrics, improving tokenizer selection for multilingual language models.

Findings

01

Tokenizer impact is negligible in English tasks but significant in multilingual settings.

02

Smaller models can predict larger models' tokenizer effects accurately.

03

New intrinsic metrics correlate better with downstream performance than compression metrics.

Abstract

The choice of tokenizer can profoundly impact language model performance, yet accessible and reliable evaluations of tokenizer quality remain an open challenge. Inspired by scaling consistency, we show that smaller models can accurately predict significant differences in tokenizer impact on larger models at a fraction of the compute cost. By systematically evaluating both English-centric and multilingual tokenizers, we find that tokenizer choice has negligible effects on tasks in English but results in consistent performance differences in multilingual settings. We propose new intrinsic tokenizer metrics inspired by Zipf's law that correlate more strongly with downstream performance than text compression when modeling unseen languages. By combining several metrics to capture multiple aspects of tokenizer behavior, we develop a reliable framework for intrinsic tokenizer evaluations. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Intelligent Tutoring Systems and Adaptive Learning